Ling 202: session 02: linear regr. 1 (key)

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

02 Apr 2025 12-34-56

1 Intro

Let’s load a data set for a few linear modeling examples; the data are in _input/rts.csv and you can find information about the variables/columns in _input/rts.r.

rm(list=ls(all.names=TRUE))
library(car); library(effects); library(emmeans)
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/rts.csv",  # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE            RT             LENGTH         LANGUAGE         GROUP
 Min.   :   1   Min.   : 271.0   Min.   :3.000   english:4023   english :3961
 1st Qu.:2150   1st Qu.: 505.0   1st Qu.:4.000   spanish:3977   heritage:4039
 Median :4310   Median : 595.0   Median :5.000
 Mean   :4303   Mean   : 661.5   Mean   :5.198
 3rd Qu.:6450   3rd Qu.: 732.0   3rd Qu.:6.000
 Max.   :8610   Max.   :4130.0   Max.   :9.000
                NA's   :248
      CORRECT        FREQPMW           ZIPFFREQ     SITE
 correct  :7749   Min.   :   1.00   Min.   :3.000   a:2403
 incorrect: 251   1st Qu.:  17.00   1st Qu.:4.230   b:2815
                  Median :  42.00   Median :4.623   c:2782
                  Mean   :  81.14   Mean   :4.591
                  3rd Qu.: 101.00   3rd Qu.:5.004
                  Max.   :1152.00   Max.   :6.061
                                                            

As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach. But first we change one factor in the data (for purely didactic reasons) and get rid of the missing data:

d$LANGUAGE <- factor(d$LANGUAGE,   # for didactic reasons only, I am
   levels=levels(d$LANGUAGE)[2:1]) # changing the order of the levels
summary(d <- droplevels(d[complete.cases(d),]))
      CASE            RT             LENGTH       LANGUAGE         GROUP
 Min.   :   1   Min.   : 271.0   Min.   :3.0   spanish:3775   english :3793
 1st Qu.:2123   1st Qu.: 505.0   1st Qu.:4.0   english:3977   heritage:3959
 Median :4262   Median : 595.0   Median :5.0
 Mean   :4268   Mean   : 661.5   Mean   :5.2
 3rd Qu.:6405   3rd Qu.: 732.0   3rd Qu.:6.0
 Max.   :8610   Max.   :4130.0   Max.   :9.0
      CORRECT        FREQPMW           ZIPFFREQ     SITE
 correct  :7749   Min.   :   1.00   Min.   :3.000   a:2403
 incorrect:   3   1st Qu.:  18.00   1st Qu.:4.255   b:2815
                  Median :  43.00   Median :4.633   c:2534
                  Mean   :  82.88   Mean   :4.609
                  3rd Qu.: 104.00   3rd Qu.:5.017
                  Max.   :1152.00   Max.   :6.061           

2 Deviance

A useful thing to consider is deviance, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable’s values are not all the mean, they vary around it:

summary(d$RT)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  271.0   505.0   595.0   661.5   732.0  4130.0 

One way to express the degree to which the values of RT vary around the mean would be the average absolute deviation:

mean(abs(d$RT - mean(d$RT, na.rm=TRUE)), na.rm=TRUE)
[1] 172.5416

But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called deviance:

sum((d$RT - mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)
[1] 536471586

This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called null model, i.e. a model with no predictors:

summary(m_00 <- lm(       # make/summarize the linear model m_00:
   RT ~ 1,                # RT ~ an overall intercept (1) & no predictor
   data=d,                # those variables are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT ~ 1, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-390.5 -156.5  -66.5   70.5 3468.5

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  661.469      2.988   221.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 263.1 on 7751 degrees of freedom

That model’s intercept is the overall mean of the response variable and that model’s deviance is the same as what we computed manually just above:

deviance(m_00)
[1] 536471586

Note that this null model actually returns the same result as when you use a one-sample t-test on the reaction times; compare the mean and the t-value:

t.test(d$RT)

    One Sample t-test

data:  d$RT
t = 221.37, df = 7751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 655.6111 667.3259
sample estimates:
mean of x
 661.4685 

And this is how we’d get the confidence intervals that t.test provides from the regression model:

Confint(m_00)
            Estimate    2.5 %   97.5 %
(Intercept) 661.4685 655.6111 667.3259

And the manual computation would be this:

# lower bound:
coef(m_00)[1] +              # the intercept plus
   summary(m_00)$coef[1,2] * # the intercept's se
   qt(0.025, df=7751)        # the t-value for 2.5% at that df
(Intercept)
   655.6111 
# upper bound:
coef(m_00)[1] +              # the intercept plus
   summary(m_00)$coef[1,2] * # the intercept's se
   qt(0.975, df=7751)        # the t-value for 2.5% at that df
(Intercept)
   667.3259 

Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …

3 A numeric predictor

Does the reaction time to a word (in ms) vary as a function of the frequency of that word?

3.1 Exploration & preparation

How would we want to explore the response variable?

hist(d$RT); hist(d$RT, breaks="FD")

We can’t really work with that. What if we log this?

hist(log2(d$RT)); hist(log2(d$RT), breaks="FD")

Much better, that’s what we need! We add the logged version to the data:

d$RT_log <- log2(d$RT)

That means we need to compute the deviance again!

summary(m_00 <- lm(       # make/summarize the linear model m_00:
   RT_log ~ 1,            # RT_log ~ an overall intercept (1) & no predictor
   data=d,                # those variables are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT_log ~ 1, data = d, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max
-1.20705 -0.30906 -0.07245  0.22650  2.72273

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.289201   0.005131    1810   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4518 on 7751 degrees of freedom
deviance(m_00)
[1] 1581.951

Now the predictor:

hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")

plot(
   main="RT (in ms, logged) as a function of frequency per million words",
   pch=16, col="#00000010",
   xlab="Frequency pmw"     , xlim=c(0, 1200), x=d$FREQPMW,
   ylab="RT (in ms, logged)", ylim=c(8, 12  ), y=d$RT_log); grid()
abline(lm(d$RT_log ~ d$FREQPMW), col="red", lwd=2)

Again, this doesn’t look good. Thus, we ask, does the reaction time to a word (in ms) vary as a function of a logged version of the frequency of that word? For that we use the already provided variable ZIPFFREQ so let’s explore the new predictor:

# the new predictor(s) on its/their own
hist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")

# the predictor(s) w/ the response
plot(
   main="RT (in ms, logged) as a function of Zipf frequency",
   pch=16, col="#00000010",
   xlab="Zipf frequency"    , xlim=c(3,  6), x=d$ZIPFFREQ,
   ylab="RT (in ms, logged)", ylim=c(8, 12), y=d$RT_log); grid()
abline(lm(d$RT_log ~ d$ZIPFFREQ), col="red", lwd=2)

There we go …

3.2 Modeling & numerical interpretation

How would we have dealt with this based on Chapter 4 alone? We would have computed a Pearson product-moment correlation coefficient r (or spearman’s ρ, either with cor.test) and maybe a simple linear model (with lm). Thus, we might have done something like this (for Pearson’s r) …

cor.test(d$RT_log, d$ZIPFFREQ)

    Pearson's product-moment correlation

data:  d$RT_log and d$ZIPFFREQ
t = -12.939, df = 7750, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1671408 -0.1235589
sample estimates:
       cor
-0.1454204 

… or this for the simplest possible linear regression model:

summary(m_01 <- lm(       # make/summarize the linear model m_01:
   RT_log ~ 1 +           # RT_log ~ an overall intercept (1)
   ZIPFFREQ,              # & the predictor ZIPFFREQ
   data=d,                # those variables are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT_log ~ 1 + ZIPFFREQ, data = d, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max
-1.21421 -0.30499 -0.07338  0.22546  2.75795

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  9.84212    0.04303  228.72   <2e-16 ***
ZIPFFREQ    -0.11995    0.00927  -12.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.447 on 7750 degrees of freedom
Multiple R-squared:  0.02115,   Adjusted R-squared:  0.02102
F-statistic: 167.4 on 1 and 7750 DF,  p-value: < 2.2e-16

Note the equivalence of the classical Pearson’s r and the linear model we fit:

  • the t-statistic from Pearson’s r (-12.9394984) squared is m_01’s F-value (167.4306191);
  • the t-statistic from Pearson’s r (-12.9394984) is also the t-statistic of ZIPFFREQ (-12.9394984);
  • the df of Pearson’s r (7750) is also the df2 of m_01 (7750);
  • the value of Pearson’s r (-0.1454204) squared is m_01’s Multiple R-squared (0.0211471).

What is the confidence interval of the coefficients?

Confint(m_01)
              Estimate      2.5 %     97.5 %
(Intercept)  9.8421186  9.7577652  9.9264721
ZIPFFREQ    -0.1199524 -0.1381246 -0.1017802

And the manual computation would be this:

# lower bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
(Intercept)
   9.757765 
# upper bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
(Intercept)
   9.926472 
# lower bound for the slope:
coef(m_01)[2] +              # the slope plus
   summary(m_01)$coef[2,2] * # the slope's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
  ZIPFFREQ
-0.1381246 
# upper bound for the slope:
coef(m_01)[2] +              # the slope plus
   summary(m_01)$coef[2,2] * # the slope's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
  ZIPFFREQ
-0.1017802 

What is the deviance of our model with its one predictor?

deviance(m_01)
[1] 1548.498

Wow, that doesn’t seem to have reduced the deviance much from that of the null model, which was 1581.9514799. How much is the reduction in %?

"/"(
   deviance(m_00)-deviance(m_01),
   deviance(m_00))
[1] 0.02114709

Does that number look familiar?

Let’s now also add the predictions of the model to the original data frame (usually a good (house-keeping) habit to adopt):

d$PRED <-   # make d$PRED the
   predict( # predictions for all cases
   m_01)    # based on m_01
head(d) # check the result
  CASE   RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE    RT_log
1    1  748      5  spanish heritage correct      19 4.278754    b  9.546894
2    2  408      5  spanish heritage correct      78 4.892095    b  8.672425
3    3  483      6  spanish heritage correct     233 5.367356    b  8.915879
4    4 1266      4  spanish heritage correct     133 5.123852    b 10.306062
5    5  636      6  spanish heritage correct      14 4.146128    b  9.312883
6    6  395      6  spanish heritage correct      67 4.826075    b  8.625709
      PRED
1 9.328872
2 9.255300
3 9.198291
4 9.227500
5 9.344780
6 9.263219

3.3 Visual interpretation

Let’s visualize the nature of the effect of ZIPFFREQ based on the predictions:

(zpf_d <- data.frame( # make zpf_d a data frame of
   zpf <- effect(     # the effect
      "ZIPFFREQ",     # of ZIPFFREQ
      m_01)))         # in m_01
  ZIPFFREQ      fit          se    lower    upper
1      3.0 9.482261 0.015760356 9.451367 9.513156
2      3.8 9.386299 0.009060110 9.368539 9.404060
3      4.5 9.302333 0.005177331 9.292184 9.312482
4      5.3 9.206371 0.008170179 9.190355 9.222386
5      6.1 9.110409 0.014720701 9.081552 9.139265
plot(zpf,                     # plot the effect of ZIPFFREQ
   xlab="Zipf frequency",     # w/ this x-axis label
   ylab="RT (in ms, logged)", # w/ this y-axis label
   ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
   grid=TRUE)                 # & w/ a grid

And here a version with everything: observed values, predicted values, and the confidence band of the predictions/regression line:

plot(main="RT (in ms, logged) as a function of Zipf frequency",
   pch=16, col="#00000010",
   xlab="Zipf frequency"    , xlim=c(3,  6), x=d$ZIPFFREQ,
   ylab="RT (in ms, logged)", ylim=c(8, 12), y=d$RT_log); grid()
axis(4, at=axTicks(2), labels=round(2^axTicks(2), 1))
# add the regression line:
abline(m_01, col="red", lwd=2)
# add the confidence band:
polygon(
   c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),
   c(zpf_d$lower, rev(zpf_d$upper)),
   col="#FF000030", border=NA)

3.4 Write-up

We wanted to determine whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words). However, preliminary exploration indicated that both the response RT and the predictor FREQPMW were extremely skewed with high outliers, which is why

  • the response variable was transformed with the binary log;
  • the predictor FREQPMW was replaced by the (inherently log-scaled) predictor ZIPFFREQ.

Then, a linear model was fit with RT_log as the response variable and ZIPFFREQ as a predictor. The model was highly significant (F=167.4, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.021, adj. R2=0.021). The coefficients indicate that ZIPFFREQ is negatively correlated with RT_log, as shown in the summary table and the figure shown here. [Show a plot]

Estimate 95%-CI se t p2-tailed
Intercept (ZIPFFREQ=0) 9.842 [9.758, 9.926] 0.043 228.72 <0.001
ZIPFFREQ0→1 -0.12 [-0.138, -0.102] 0.009 -12.94 <0.001
# housekeeping: remove the predictions
str(d <- d[,1:10])
'data.frame':   7752 obs. of  10 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "spanish","english": 1 1 1 1 1 1 1 1 1 1 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
 $ RT_log  : num  9.55 8.67 8.92 10.31 9.31 ...

4 A binary predictor

Does the reaction time to a word (in ms) vary as a function of which language that word was presented in (Spanish vs. English)?

4.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
hist(d$RT_log); hist(d$RT_log, breaks="FD") # we already did that anyway

table(d$LANGUAGE)

spanish english
   3775    3977 
# the predictor(s) w/ the response
qwe <- boxplot(main="RT (in ms, logged) as a function of stimulus language",
   d$RT_log ~ d$LANGUAGE, notch=TRUE,
   xlab="Stimulus language",
   ylab="RT (in ms, logged)"); grid()

4.2 Modeling & numerical interpretation

How would we have dealt with this based on Chapter 4 alone? We would have computed a t-test for independent samples (with t.test) like this:

t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE) # the default t-test (Student)

    Two Sample t-test

data:  d$RT_log by d$LANGUAGE
t = 27.155, df = 7750, p-value < 2.2e-16
alternative hypothesis: true difference in means between group spanish and group english is not equal to 0
95 percent confidence interval:
 0.2471709 0.2856323
sample estimates:
mean in group spanish mean in group english
             9.425872              9.159471 

This t-test is actually also just a very simple linear regression model:

summary(m_01 <- lm(       # make/summarize the linear model m_01:
   RT_log ~ 1 +           # RT_log ~ an overall intercept (1)
   LANGUAGE,              # & the predictor LANGUAGE
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT_log ~ 1 + LANGUAGE, data = d, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max
-1.18270 -0.28696 -0.06671  0.21618  2.58605

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      9.425872   0.007027 1341.44   <2e-16 ***
LANGUAGEenglish -0.266402   0.009810  -27.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4317 on 7750 degrees of freedom
Multiple R-squared:  0.08688,   Adjusted R-squared:  0.08677
F-statistic: 737.4 on 1 and 7750 DF,  p-value: < 2.2e-16

Note the equivalence of the classical t-test and the linear model we fit (with the exception of the direction of the differences):

  • the t-statistic from the t-test (27.1554785) squared is m_01’s F-value (737.4200137);
  • the t-statistic from the t-test (27.1554785) is also the (negative of the) t-statistic of LANGUAGEspanish (-27.1554785);
  • the df of the t-test (7750) is also the df2 of m_01 (7750);
  • the first mean reported by the t-test (mean in group spanish, of 9.4258724) is the intercept of m_01 (of 9.4258724);
  • the second mean reported by the t-test (mean in group english, of 9.1594708) is the sum of intercept of m_01 (of 9.4258724) and the mean difference/slope of m_01 (of -0.2664016);
  • the confidence interval of the t-test (0.2471709, 0.2856323) is the (negative) coefficient for LANGUAGEenglish (of -0.2664016) plus/minus the product of
    • its standard error (of 0.0098102) and
    • the t-value that cuts off 2.5% of the area under the curve for df=7750 (qt(p=0.025, df=7750), i.e. -1.9602701).

This CI computation for the linear model is shown here, first manually again and then the easy way with Confint:

# lower bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
(Intercept)
   9.412098 
# upper bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
(Intercept)
   9.439647 
# lower bound for the difference/slope:
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
LANGUAGEenglish
     -0.2856323 
# upper bound for the difference/slope:
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
LANGUAGEenglish
     -0.2471709 
Confint(m_01)
                  Estimate      2.5 %     97.5 %
(Intercept)      9.4258724  9.4120982  9.4396466
LANGUAGEenglish -0.2664016 -0.2856323 -0.2471709

Back to the model:

What is the deviance of our model with its one predictor?

deviance(m_01)
[1] 1444.505

This seems to have reduced the deviance somewhat from that of the null model, which was 1581.9514799 – again, how much is the reduction in %?

"/"(
   deviance(m_00)-deviance(m_01),
   deviance(m_00))
[1] 0.08688388

Which I am again hoping is familiar by now …

We again add the predictions of the model to the original data frame:

d$PRED <-   # make d$PRED the
   predict( # predictions for all cases
   m_01)    # based on m_01
head(d, 3) # check ...
  CASE  RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE   RT_log
1    1 748      5  spanish heritage correct      19 4.278754    b 9.546894
2    2 408      5  spanish heritage correct      78 4.892095    b 8.672425
3    3 483      6  spanish heritage correct     233 5.367356    b 8.915879
      PRED
1 9.425872
2 9.425872
3 9.425872
tail(d, 3) # the result
     CASE  RT LENGTH LANGUAGE   GROUP CORRECT FREQPMW ZIPFFREQ SITE   RT_log
7998 8608 947      5  english english correct      80 4.903090    a 9.887221
7999 8609 660      4  english english correct     746 5.872739    a 9.366322
8000 8610 437      5  english english correct      54 4.732394    c 8.771489
         PRED
7998 9.159471
7999 9.159471
8000 9.159471

4.3 Visual interpretation

Let’s visualize the nature of the effect of LANGUAGE based on the predictions:

(lan_d <- data.frame( # make lan_d a data frame of
   lan <- effect(     # the effect
      "LANGUAGE",     # of LANGUAGE
      m_01)))         # in m_01
  LANGUAGE      fit          se    lower    upper
1  spanish 9.425872 0.007026681 9.412098 9.439647
2  english 9.159471 0.006845906 9.146051 9.172891
plot(lan,             # plot the effect of LANGUAGE
   xlab="Language",           # w/ this x-axis label
   ylab="RT (in ms, logged)", # w/ this y-axis label
   ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
   grid=TRUE)                 # & w/ a grid

And here a version with everything: observed values, predicted values, the confidence intervals of the predictions, and a horizontal decile grid:

stripchart(main="RT_log ~ LANGUAGE", # stripchart w/ main heading
   pch=16, col="#00000010",          # filled semi-transparent grey circles
   d$RT_log ~ d$LANGUAGE,            # data: RT_log ~ LANGUAGE
   xlab="Language", ylab="RT in ms", # axis labels
   vertical=TRUE,                    # make the chart have a vertical y-axis
   method="jitter")                  # jitter observations
# the next lne creates a horizontal decile grid
abline(h=quantile(d$RT_log, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")
points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &
for (i in seq(lan_d$fit)) {                           # their confidence intervals
   arrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")
}

4.4 Write-up

To determine whether the reaction time to a word (in ms) varies as a function of the language the word was presented in (english vs. spanish), a linear model was fit with RT as the response variable and LANGUAGE as a predictor. The model was highly significant (F=737.4, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.087, adj. R2=0.087). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot]

Estimate 95%-CI se t p2-tailed
Intercept (LANGUAGE=english) 9.426 [9.412, 9.44] 0.007 1341.44 <0.001
LANGUAGEenglishspanish -0.266 [-0.286, -0.247] 0.01 -27.16 <0.001
# housekeeping: remove the predictions
str(d <- d[,1:10])
'data.frame':   7752 obs. of  10 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "spanish","english": 1 1 1 1 1 1 1 1 1 1 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
 $ RT_log  : num  9.55 8.67 8.92 10.31 9.31 ...

5 A categorical predictor

Does the reaction time to a word (in ms) vary as a function of in which of three experimental sites the data were collected (a vs. b vs. c)?

5.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
hist(d$RT_log); hist(d$RT_log, breaks="FD") # we already did that anyway

table(d$SITE)

   a    b    c
2403 2815 2534 
# the predictor(s) w/ the response
qwe <- boxplot(main="RT (in ms, logged) ~  experimental site",
   d$RT_log ~ d$SITE, notch=TRUE,
   xlab="Experimental site",
   ylab="RT in (in ms, logged)"); grid()

It is noteworthy that site c differs considerably from the others in the outlier distribution; this would definitely require some attention! [In fact, we will see something more interesting later.]

5.2 Modeling & numerical interpretation

Let’s fit our regression model:

summary(m_01 <- lm(       # make/summarize the linear model m_01:
   RT_log ~ 1 +           # RT_log ~ an overall intercept (1)
   SITE,                  # & the predictor SITE
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT_log ~ 1 + SITE, data = d, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max
-1.10891 -0.30613 -0.08053  0.23524  2.65471

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)
(Intercept) 9.191055   0.009111 1008.810   <2e-16 ***
SITEb       0.120700   0.012404    9.731   <2e-16 ***
SITEc       0.166161   0.012717   13.066   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4466 on 7749 degrees of freedom
Multiple R-squared:  0.02295,   Adjusted R-squared:  0.0227
F-statistic:    91 on 2 and 7749 DF,  p-value: < 2.2e-16

This CI computation for the linear model is shown here, first manually again and then the easy way with Confint:

# lower bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.025, df=7749)        # the t-value for 2.5% at that df
(Intercept)
   9.173196 
# upper bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.975, df=7749)        # the t-value for 2.5% at that df
(Intercept)
   9.208915 
# lower bound for the difference/slope for SITE b (rel. to a):
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.025, df=7749)        # the t-value for 2.5% at that df
     SITEb
0.09638472 
# upper bound for the difference/slope for SITE b (rel. to a):
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.975, df=7749)        # the t-value for 2.5% at that df
    SITEb
0.1450159 
# lower bound for the difference/slope for SITE c (rel. to a):
coef(m_01)[3] +              # the difference/slope plus
   summary(m_01)$coef[3,2] * # the difference/slope's se
   qt(0.025, df=7749)        # the t-value for 2.5% at that df
    SITEc
0.1412324 
# upper bound for the difference/slope for SITE c (rel. to a):
coef(m_01)[3] +              # the difference/slope plus
   summary(m_01)$coef[3,2] * # the difference/slope's se
   qt(0.975, df=7749)        # the t-value for 2.5% at that df
    SITEc
0.1910899 
Confint(m_01)
             Estimate      2.5 %    97.5 %
(Intercept) 9.1910552 9.17319560 9.2089148
SITEb       0.1207003 0.09638472 0.1450159
SITEc       0.1661612 0.14123245 0.1910899

What is the deviance of our model with its one predictor (but 2 contrasts from the intercept)?

deviance(m_01)
[1] 1545.65

Again pretty bad – how much is the reduction in %?

"/"(
   deviance(m_00)-deviance(m_01),
   deviance(m_00))
[1] 0.02294739

As usual, we add the predictions of the model to the original data frame:

# replace this with your answers/code
head(d, 3) # check ...
  CASE  RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE   RT_log
1    1 748      5  spanish heritage correct      19 4.278754    b 9.546894
2    2 408      5  spanish heritage correct      78 4.892095    b 8.672425
3    3 483      6  spanish heritage correct     233 5.367356    b 8.915879
tail(d, 3) # the result
     CASE  RT LENGTH LANGUAGE   GROUP CORRECT FREQPMW ZIPFFREQ SITE   RT_log
7998 8608 947      5  english english correct      80 4.903090    a 9.887221
7999 8609 660      4  english english correct     746 5.872739    a 9.366322
8000 8610 437      5  english english correct      54 4.732394    c 8.771489

5.3 Visual interpretation

Let’s visualize the nature of the effect of SITE based on the predictions:

(sit_d <- data.frame( # make sit_d a data frame of
   sit <- effect   (  # the effect
      "SITE",         # of SITE
      m_01)))         # in m_01
  SITE      fit          se    lower    upper
1    a 9.191055 0.009110786 9.173196 9.208915
2    b 9.311756 0.008417702 9.295255 9.328256
3    c 9.357216 0.008872161 9.339825 9.374608
plot(sit,                     # plot the effect of SITE
   xlab="Site",               # w/ this x-axis label
   ylab="RT (in ms, logged)", # w/ this y-axis label
   ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
   grid=TRUE)                 # & w/ a grid

And here a version with everything: observed values, predicted values, the confidence intervals of the predictions, and a horizontal decile grid:

stripchart(main="RT_log ~ SITE", # stripchart w/ main heading
   pch=16, col="#00000010",      # filled semi-transparent grey circles
   d$RT_log ~ d$SITE,            # data: RT_log ~ SITE
   xlab="RT (in ms, logged)", ylab="Site", # axis labels
   vertical=TRUE,                # make the chart have a vertical y-axis
   method="jitter")              # jitter observations
# add horizontal grid lines at deciles & indicate the middle 90%:
abline(h=quantile(d$RT_log, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")
points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &
for (i in seq(sit_d$fit)) { # their confidence intervals
   arrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")
}

Excursus: This is an interesting case because only the very last plot shows something that we haven’t noticed before, namely the fact that site c has a surprising gap of RT values around 600 ms (specifically, no values in the interval [590, 602], check out the arrow in the plot below); we can see this much more clearly in what is one of my favorite exploratory plots anyway, an ecdf plot:

plot(main="RT (in ms, logged) ~ experimental site", type="n",
   xlab="RT (in ms, logged)"   , xlim=c(7.75, 12.25), x=0,
   ylab="Cumulative proportion", ylim=c(0   ,  1   ), y=0); grid()
for (i in 1:3) {            # for each of the 3 sites
   lines(pch=".",           # plot lines w/ this point character
      # namely ecdf lines using these colors
      ecdf(d$RT_log[d$SITE==levels(d$SITE)[i]]), col=rainbow(3)[i]) #
}
legend(title="Site",      # add a legend with the title "Site"
   x=10.5, xjust=0.5,     # at these x-coordinates (centered)
   y= 0.5, yjust=0.5,     # at these y-coordinates (centered)
   legend=levels(d$SITE), # these are the labels, show them ...
   fill=rainbow(3),       # w/ these colors
   ncol=3, bty="n")       # organize the legend in 3 columns, no box
arrows(11, 0.7, 10.2, 0.9, angle=15, len=0.2)

5.4 Excursus: model comparison

In cases where one has multiple levels of a categorical predictor, sometimes the question arises as to whether (all) levels of that predictor are in fact significantly different from each other. In this particular case, the effects plot for SITE very clearly suggests that they are (because the confidence intervals of all predicted means do not overlap), but let’s discuss briefly how this can be tested with a method called model comparison: If you have two models that were fit to the same data set and where one model is a sub-model of the other model – typically such that they differ only by one predictor/term – then you can use the function anova to compare the two models to see whether the more complex model can justify its higher degree of complexity by being significantly better than the less complex model. In this case, 2 of the 3 possible comparisons of SITE are already in the summary table:

summary(m_01)$coefficients
             Estimate  Std. Error     t value     Pr(>|t|)
(Intercept) 9.1910552 0.009110786 1008.810364 0.000000e+00
SITEb       0.1207003 0.012404198    9.730601 2.992632e-22
SITEc       0.1661612 0.012716983   13.066084 1.313862e-38

From this, we know that SITE: a is significantly different from SITE: b (difference=0.1207003, p=2.992632e-22) and SITE: c (difference=0.1661612, p=1.313862e-38), but we don’t know whether the 0.1661612-0.1207003=0.0454609 difference between SITE: b and SITE: c is significant as well. To find that out, we can compare m_01 to a model in which SITE: b and SITE: c are not distinguished. For that, we first, we create a version of SITE that conflates levels as desired:

d$SITE_confl <- d$SITE # create a copy of SITE
levels(d$SITE_confl) <- # overwrite the copy's levels such that
   c("a",               # "a" remains "a"
     "b/c", "b/c")      # "b" & "c" become "b/c"
table(SITE=d$SITE, SITE_confl=d$SITE_confl) # check you did it right
    SITE_confl
SITE    a  b/c
   a 2403    0
   b    0 2815
   c    0 2534
head(d)
  CASE   RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE    RT_log
1    1  748      5  spanish heritage correct      19 4.278754    b  9.546894
2    2  408      5  spanish heritage correct      78 4.892095    b  8.672425
3    3  483      6  spanish heritage correct     233 5.367356    b  8.915879
4    4 1266      4  spanish heritage correct     133 5.123852    b 10.306062
5    5  636      6  spanish heritage correct      14 4.146128    b  9.312883
6    6  395      6  spanish heritage correct      67 4.826075    b  8.625709
  SITE_confl
1        b/c
2        b/c
3        b/c
4        b/c
5        b/c
6        b/c

We already have the model with SITE as the only predictor (m_01) so we only need to fit a new model m_01_siteconfl with SITE_confl as the only predictor:

summary(m_01_siteconfl <- lm( # make/summarize the linear model m_01_siteconfl:
   RT_log ~ 1 + SITE_confl,   # RT_log ~ an overall intercept (1) & SITE_confl
   data=d, na.action=na.exclude)) # the other arguments

Call:
lm(formula = RT_log ~ 1 + SITE_confl, data = d, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max
-1.10891 -0.30841 -0.08063  0.23555  2.67863

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)   9.191055   0.009118 1007.98   <2e-16 ***
SITE_conflb/c 0.142237   0.010977   12.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.447 on 7750 degrees of freedom
Multiple R-squared:  0.02121,   Adjusted R-squared:  0.02108
F-statistic: 167.9 on 1 and 7750 DF,  p-value: < 2.2e-16

… and then we do a comparison with anova:

anova(m_01,        # compare the model w/ the 3-level predictor SITE
   m_01_siteconfl, # & the model w/ the 2-level predictor SITE_confl
   test="F")       # using an F-test
Analysis of Variance Table

Model 1: RT_log ~ 1 + SITE
Model 2: RT_log ~ 1 + SITE_confl
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)
1   7749 1545.7
2   7750 1548.4 -1   -2.7561 13.817 0.0002029 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the two models is highly significant: Giving up the distinction between SITE: b and SITE: c makes the model not just worse, but significantly worse, meaning the finer resolution of SITE in m_01 compared to SITE_confl in m_01_siteconfl makes that model m_01 significantly better (p<0.001). Thus, we now know not only the size of the 3rd comparison (SITE: b vs. c, which was clear from summary(m_01)), but also its p-value.

Another way of getting such results is available from emmeans::emmeans. Here,

  • lines 2 and 3 of the model’s summary table correspond to lines 1 and 2 of the emmeans output;
  • line 3 of the emmeans output corresponds to the result we obtained with the model conflation approach above: the F-value of the model conflation approach is the squared t-value of the emmeans approach:
summary(m_01)$coefficients # repeating what the coefs were
             Estimate  Std. Error     t value     Pr(>|t|)
(Intercept) 9.1910552 0.009110786 1008.810364 0.000000e+00
SITEb       0.1207003 0.012404198    9.730601 2.992632e-22
SITEc       0.1661612 0.012716983   13.066084 1.313862e-38
emmeans(        # the predicted means computed
   object=m_01, # for the model m_01
   ~ SITE)      # namely all contrasts of this predictor
 SITE emmean      SE   df lower.CL upper.CL
 a     9.191 0.00911 7749    9.173    9.209
 b     9.312 0.00842 7749    9.295    9.328
 c     9.357 0.00887 7749    9.340    9.375

Confidence level used: 0.95 
pairs(             # compute pairwise comparisons
    emmeans(       # of predicted means computed
      object=m_01, # for the model m_01
       ~ SITE),    # namely all contrasts of this predictor
   adjust="none")  # don't adjust for 3 comparisons (careful w/ this)
 contrast estimate     SE   df t.ratio p.value
 a - b     -0.1207 0.0124 7749  -9.731  <.0001
 a - c     -0.1662 0.0127 7749 -13.066  <.0001
 b - c     -0.0455 0.0122 7749  -3.717  0.0002

5.5 Write-up

To determine whether the reaction time to a word (in ms) varies as a function of the site where the data were collected (a vs. b vs. c), a linear model was fit with RT as the response variable and SITE as a predictor. The model was highly significant (F=112, df1=2, df2=7749, p<0.0001) but explains very little of the response variable (mult. R2=0.028, adj. R2=0.028). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail (esp. for SITE: c).

Estimate 95%-CI se t p2-tailed
Intercept (SITE=a) 9.191 [9.173, 9.209] 0.009 1008.81 <0.001
SITEab 0.121 [0.096, 0.145] 0.012 9.73 <0.001
SITEac 0.166 [0.141, 0.191] 0.013 13.07 <0.001

A post-hoc comparison of the above model with one in which sites b and c were conflated indicates that sites b and c are also significantly different from each other (difference between means: 0.046, F=13.817/t=-3.717, punadjusted<0.001).

6 Homework

To prepare for next session,

  • read SFLWR3,
    • Section 5.1 on multifactoriality;
    • Section 5.2.1-5.2.3 on linear regression modeling;
    • Section 5.5 on model selection.

7 Session info

sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] emmeans_1.11.0 effects_4.2-2  car_3.1-3      carData_3.0-5  STGmisc_1.01
[6] Rcpp_1.0.14    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] Matrix_1.7-3       jsonlite_2.0.0     survey_4.4-2       splines_4.4.3
 [5] boot_1.3-31        yaml_2.3.10        fastmap_1.2.0      TH.data_1.1-3
 [9] coda_0.19-4.1      lattice_0.22-6     Formula_1.2-5      knitr_1.50
[13] rbibutils_2.3      htmlwidgets_1.6.4  MASS_7.3-65        nloptr_2.2.1
[17] insight_1.1.0      nnet_7.3-20        minqa_1.2.8        DBI_1.2.3
[21] rlang_1.1.5        multcomp_1.4-28    xfun_0.51          estimability_1.5.1
[25] cli_3.6.4          Rdpack_2.6.3       digest_0.6.37      grid_4.4.3
[29] xtable_1.8-4       mvtnorm_1.3-3      rstudioapi_0.17.1  sandwich_3.1-1
[33] lme4_1.1-37        nlme_3.1-168       reformulas_0.4.0   evaluate_1.0.3
[37] codetools_0.2-20   zoo_1.8-13         mitools_2.4        survival_3.8-3
[41] abind_1.4-8        colorspace_2.1-1   rmarkdown_2.29     tools_4.4.3
[45] htmltools_0.5.8.1