Let’s load a data set for a few linear modeling examples; the data are in _input/rts.csv and you can find information about the variables/columns in _input/rts.r.
rm(list=ls(all.names=TRUE))library(car); library(effects); library(emmeans)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE RT LENGTH LANGUAGE GROUP
Min. : 1 Min. : 271.0 Min. :3.000 english:4023 english :3961
1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spanish:3977 heritage:4039
Median :4310 Median : 595.0 Median :5.000
Mean :4303 Mean : 661.5 Mean :5.198
3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
Max. :8610 Max. :4130.0 Max. :9.000
NA's :248
CORRECT FREQPMW ZIPFFREQ SITE
correct :7749 Min. : 1.00 Min. :3.000 a:2403
incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
Median : 42.00 Median :4.623 c:2782
Mean : 81.14 Mean :4.591
3rd Qu.: 101.00 3rd Qu.:5.004
Max. :1152.00 Max. :6.061
As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach. But first we change one factor in the data (for purely didactic reasons) and get rid of the missing data:
d$LANGUAGE <-factor(d$LANGUAGE, # for didactic reasons only, I amlevels=levels(d$LANGUAGE)[2:1]) # changing the order of the levelssummary(d <-droplevels(d[complete.cases(d),]))
CASE RT LENGTH LANGUAGE GROUP
Min. : 1 Min. : 271.0 Min. :3.0 spanish:3775 english :3793
1st Qu.:2123 1st Qu.: 505.0 1st Qu.:4.0 english:3977 heritage:3959
Median :4262 Median : 595.0 Median :5.0
Mean :4268 Mean : 661.5 Mean :5.2
3rd Qu.:6405 3rd Qu.: 732.0 3rd Qu.:6.0
Max. :8610 Max. :4130.0 Max. :9.0
CORRECT FREQPMW ZIPFFREQ SITE
correct :7749 Min. : 1.00 Min. :3.000 a:2403
incorrect: 3 1st Qu.: 18.00 1st Qu.:4.255 b:2815
Median : 43.00 Median :4.633 c:2534
Mean : 82.88 Mean :4.609
3rd Qu.: 104.00 3rd Qu.:5.017
Max. :1152.00 Max. :6.061
2 Deviance
A useful thing to consider is deviance, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable’s values are not all the mean, they vary around it:
summary(d$RT)
Min. 1st Qu. Median Mean 3rd Qu. Max.
271.0 505.0 595.0 661.5 732.0 4130.0
One way to express the degree to which the values of RT vary around the mean would be the average absolute deviation:
But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called deviance:
sum((d$RT -mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)
[1] 536471586
This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called null model, i.e. a model with no predictors:
summary(m_00 <-lm( # make/summarize the linear model m_00: RT ~1, # RT ~ an overall intercept (1) & no predictordata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-390.5 -156.5 -66.5 70.5 3468.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 661.469 2.988 221.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 263.1 on 7751 degrees of freedom
That model’s intercept is the overall mean of the response variable and that model’s deviance is the same as what we computed manually just above:
deviance(m_00)
[1] 536471586
Note that this null model actually returns the same result as when you use a one-sample t-test on the reaction times; compare the mean and the t-value:
t.test(d$RT)
One Sample t-test
data: d$RT
t = 221.37, df = 7751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
655.6111 667.3259
sample estimates:
mean of x
661.4685
And this is how we’d get the confidence intervals that t.test provides from the regression model:
# lower bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.025, df=7751) # the t-value for 2.5% at that df
(Intercept)
655.6111
# upper bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.975, df=7751) # the t-value for 2.5% at that df
(Intercept)
667.3259
Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …
3 A numeric predictor
Does the reaction time to a word (in ms) vary as a function of the frequency of that word?
3.1 Exploration & preparation
How would we want to explore the response variable?
hist(d$RT); hist(d$RT, breaks="FD")
We can’t really work with that. What if we log this?
hist(log2(d$RT)); hist(log2(d$RT), breaks="FD")
Much better, that’s what we need! We add the logged version to the data:
d$RT_log <-log2(d$RT)
That means we need to compute the deviance again!
summary(m_00 <-lm( # make/summarize the linear model m_00: RT_log ~1, # RT_log ~ an overall intercept (1) & no predictordata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.20705 -0.30906 -0.07245 0.22650 2.72273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.289201 0.005131 1810 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4518 on 7751 degrees of freedom
deviance(m_00)
[1] 1581.951
Now the predictor:
hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")
plot(main="RT (in ms, logged) as a function of frequency per million words",pch=16, col="#00000010",xlab="Frequency pmw" , xlim=c(0, 1200), x=d$FREQPMW,ylab="RT (in ms, logged)", ylim=c(8, 12 ), y=d$RT_log); grid()abline(lm(d$RT_log ~ d$FREQPMW), col="red", lwd=2)
Again, this doesn’t look good. Thus, we ask, does the reaction time to a word (in ms) vary as a function of a logged version of the frequency of that word? For that we use the already provided variable ZIPFFREQ so let’s explore the new predictor:
# the new predictor(s) on its/their ownhist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")
# the predictor(s) w/ the responseplot(main="RT (in ms, logged) as a function of Zipf frequency",pch=16, col="#00000010",xlab="Zipf frequency" , xlim=c(3, 6), x=d$ZIPFFREQ,ylab="RT (in ms, logged)", ylim=c(8, 12), y=d$RT_log); grid()abline(lm(d$RT_log ~ d$ZIPFFREQ), col="red", lwd=2)
There we go …
3.2 Modeling & numerical interpretation
How would we have dealt with this based on Chapter 4 alone? We would have computed a Pearson product-moment correlation coefficient r (or spearman’s ρ, either with cor.test) and maybe a simple linear model (with lm). Thus, we might have done something like this (for Pearson’s r) …
cor.test(d$RT_log, d$ZIPFFREQ)
Pearson's product-moment correlation
data: d$RT_log and d$ZIPFFREQ
t = -12.939, df = 7750, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1671408 -0.1235589
sample estimates:
cor
-0.1454204
… or this for the simplest possible linear regression model:
summary(m_01 <-lm( # make/summarize the linear model m_01: RT_log ~1+# RT_log ~ an overall intercept (1) ZIPFFREQ, # & the predictor ZIPFFREQdata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + ZIPFFREQ, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.21421 -0.30499 -0.07338 0.22546 2.75795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.84212 0.04303 228.72 <2e-16 ***
ZIPFFREQ -0.11995 0.00927 -12.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.447 on 7750 degrees of freedom
Multiple R-squared: 0.02115, Adjusted R-squared: 0.02102
F-statistic: 167.4 on 1 and 7750 DF, p-value: < 2.2e-16
Note the equivalence of the classical Pearson’s r and the linear model we fit:
the t-statistic from Pearson’s r (-12.9394984) squared is m_01’s F-value (167.4306191);
the t-statistic from Pearson’s r (-12.9394984) is also the t-statistic of ZIPFFREQ (-12.9394984);
the df of Pearson’s r (7750) is also the df2 of m_01 (7750);
the value of Pearson’s r (-0.1454204) squared is m_01’s Multiple R-squared (0.0211471).
What is the confidence interval of the coefficients?
# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df
(Intercept)
9.757765
# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df
(Intercept)
9.926472
# lower bound for the slope:coef(m_01)[2] +# the slope plussummary(m_01)$coef[2,2] *# the slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df
ZIPFFREQ
-0.1381246
# upper bound for the slope:coef(m_01)[2] +# the slope plussummary(m_01)$coef[2,2] *# the slope's seqt(0.975, df=7750) # the t-value for 2.5% at that df
ZIPFFREQ
-0.1017802
What is the deviance of our model with its one predictor?
deviance(m_01)
[1] 1548.498
Wow, that doesn’t seem to have reduced the deviance much from that of the null model, which was 1581.9514799. How much is the reduction in %?
"/"(deviance(m_00)-deviance(m_01),deviance(m_00))
[1] 0.02114709
Does that number look familiar?
Let’s now also add the predictions of the model to the original data frame (usually a good (house-keeping) habit to adopt):
d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d) # check the result
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE RT_log
1 1 748 5 spanish heritage correct 19 4.278754 b 9.546894
2 2 408 5 spanish heritage correct 78 4.892095 b 8.672425
3 3 483 6 spanish heritage correct 233 5.367356 b 8.915879
4 4 1266 4 spanish heritage correct 133 5.123852 b 10.306062
5 5 636 6 spanish heritage correct 14 4.146128 b 9.312883
6 6 395 6 spanish heritage correct 67 4.826075 b 8.625709
PRED
1 9.328872
2 9.255300
3 9.198291
4 9.227500
5 9.344780
6 9.263219
3.3 Visual interpretation
Let’s visualize the nature of the effect of ZIPFFREQ based on the predictions:
(zpf_d <-data.frame( # make zpf_d a data frame of zpf <-effect( # the effect"ZIPFFREQ", # of ZIPFFREQ m_01))) # in m_01
plot(zpf, # plot the effect of ZIPFFREQxlab="Zipf frequency", # w/ this x-axis labelylab="RT (in ms, logged)", # w/ this y-axis labelylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, and the confidence band of the predictions/regression line:
plot(main="RT (in ms, logged) as a function of Zipf frequency",pch=16, col="#00000010",xlab="Zipf frequency" , xlim=c(3, 6), x=d$ZIPFFREQ,ylab="RT (in ms, logged)", ylim=c(8, 12), y=d$RT_log); grid()axis(4, at=axTicks(2), labels=round(2^axTicks(2), 1))# add the regression line:abline(m_01, col="red", lwd=2)# add the confidence band:polygon(c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),c(zpf_d$lower, rev(zpf_d$upper)),col="#FF000030", border=NA)
3.4 Write-up
We wanted to determine whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words). However, preliminary exploration indicated that both the response RT and the predictor FREQPMW were extremely skewed with high outliers, which is why
the response variable was transformed with the binary log;
the predictor FREQPMW was replaced by the (inherently log-scaled) predictor ZIPFFREQ.
Then, a linear model was fit with RT_log as the response variable and ZIPFFREQ as a predictor. The model was highly significant (F=167.4, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.021, adj. R2=0.021). The coefficients indicate that ZIPFFREQ is negatively correlated with RT_log, as shown in the summary table and the figure shown here. [Show a plot]
Estimate
95%-CI
se
t
p2-tailed
Intercept (ZIPFFREQ=0)
9.842
[9.758, 9.926]
0.043
228.72
<0.001
ZIPFFREQ0→1
-0.12
[-0.138, -0.102]
0.009
-12.94
<0.001
# housekeeping: remove the predictionsstr(d <- d[,1:10])
Does the reaction time to a word (in ms) vary as a function of which language that word was presented in (Spanish vs. English)?
4.1 Exploration & preparation
Some exploration of the relevant variables:
# the predictor(s)/response on its/their ownhist(d$RT_log); hist(d$RT_log, breaks="FD") # we already did that anyway
table(d$LANGUAGE)
spanish english
3775 3977
# the predictor(s) w/ the responseqwe <-boxplot(main="RT (in ms, logged) as a function of stimulus language", d$RT_log ~ d$LANGUAGE, notch=TRUE,xlab="Stimulus language",ylab="RT (in ms, logged)"); grid()
4.2 Modeling & numerical interpretation
How would we have dealt with this based on Chapter 4 alone? We would have computed a t-test for independent samples (with t.test) like this:
t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE) # the default t-test (Student)
Two Sample t-test
data: d$RT_log by d$LANGUAGE
t = 27.155, df = 7750, p-value < 2.2e-16
alternative hypothesis: true difference in means between group spanish and group english is not equal to 0
95 percent confidence interval:
0.2471709 0.2856323
sample estimates:
mean in group spanish mean in group english
9.425872 9.159471
This t-test is actually also just a very simple linear regression model:
summary(m_01 <-lm( # make/summarize the linear model m_01: RT_log ~1+# RT_log ~ an overall intercept (1) LANGUAGE, # & the predictor LANGUAGEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + LANGUAGE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.18270 -0.28696 -0.06671 0.21618 2.58605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.425872 0.007027 1341.44 <2e-16 ***
LANGUAGEenglish -0.266402 0.009810 -27.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4317 on 7750 degrees of freedom
Multiple R-squared: 0.08688, Adjusted R-squared: 0.08677
F-statistic: 737.4 on 1 and 7750 DF, p-value: < 2.2e-16
Note the equivalence of the classical t-test and the linear model we fit (with the exception of the direction of the differences):
the t-statistic from the t-test (27.1554785) squared is m_01’s F-value (737.4200137);
the t-statistic from the t-test (27.1554785) is also the (negative of the) t-statistic of LANGUAGEspanish (-27.1554785);
the df of the t-test (7750) is also the df2 of m_01 (7750);
the first mean reported by the t-test (mean in group spanish, of 9.4258724) is the intercept of m_01 (of 9.4258724);
the second mean reported by the t-test (mean in group english, of 9.1594708) is the sum of intercept of m_01 (of 9.4258724) and the mean difference/slope of m_01 (of -0.2664016);
the confidence interval of the t-test (0.2471709, 0.2856323) is the (negative) coefficient for LANGUAGEenglish (of -0.2664016) plus/minus the product of
its standard error (of 0.0098102) and
the t-value that cuts off 2.5% of the area under the curve for df=7750 (qt(p=0.025, df=7750), i.e. -1.9602701).
This CI computation for the linear model is shown here, first manually again and then the easy way with Confint:
# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df
(Intercept)
9.412098
# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df
(Intercept)
9.439647
# lower bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df
LANGUAGEenglish
-0.2856323
# upper bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7750) # the t-value for 2.5% at that df
What is the deviance of our model with its one predictor?
deviance(m_01)
[1] 1444.505
This seems to have reduced the deviance somewhat from that of the null model, which was 1581.9514799 – again, how much is the reduction in %?
"/"(deviance(m_00)-deviance(m_01),deviance(m_00))
[1] 0.08688388
Which I am again hoping is familiar by now …
We again add the predictions of the model to the original data frame:
d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d, 3) # check ...
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE RT_log
1 1 748 5 spanish heritage correct 19 4.278754 b 9.546894
2 2 408 5 spanish heritage correct 78 4.892095 b 8.672425
3 3 483 6 spanish heritage correct 233 5.367356 b 8.915879
PRED
1 9.425872
2 9.425872
3 9.425872
tail(d, 3) # the result
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE RT_log
7998 8608 947 5 english english correct 80 4.903090 a 9.887221
7999 8609 660 4 english english correct 746 5.872739 a 9.366322
8000 8610 437 5 english english correct 54 4.732394 c 8.771489
PRED
7998 9.159471
7999 9.159471
8000 9.159471
4.3 Visual interpretation
Let’s visualize the nature of the effect of LANGUAGE based on the predictions:
(lan_d <-data.frame( # make lan_d a data frame of lan <-effect( # the effect"LANGUAGE", # of LANGUAGE m_01))) # in m_01
LANGUAGE fit se lower upper
1 spanish 9.425872 0.007026681 9.412098 9.439647
2 english 9.159471 0.006845906 9.146051 9.172891
plot(lan, # plot the effect of LANGUAGExlab="Language", # w/ this x-axis labelylab="RT (in ms, logged)", # w/ this y-axis labelylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, the confidence intervals of the predictions, and a horizontal decile grid:
stripchart(main="RT_log ~ LANGUAGE", # stripchart w/ main headingpch=16, col="#00000010", # filled semi-transparent grey circles d$RT_log ~ d$LANGUAGE, # data: RT_log ~ LANGUAGExlab="Language", ylab="RT in ms", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# the next lne creates a horizontal decile gridabline(h=quantile(d$RT_log, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &for (i inseq(lan_d$fit)) { # their confidence intervalsarrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")}
4.4 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the language the word was presented in (english vs. spanish), a linear model was fit with RT as the response variable and LANGUAGE as a predictor. The model was highly significant (F=737.4, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.087, adj. R2=0.087). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot]
Estimate
95%-CI
se
t
p2-tailed
Intercept (LANGUAGE=english)
9.426
[9.412, 9.44]
0.007
1341.44
<0.001
LANGUAGEenglish→spanish
-0.266
[-0.286, -0.247]
0.01
-27.16
<0.001
# housekeeping: remove the predictionsstr(d <- d[,1:10])
Does the reaction time to a word (in ms) vary as a function of in which of three experimental sites the data were collected (a vs. b vs. c)?
5.1 Exploration & preparation
Some exploration of the relevant variables:
# the predictor(s)/response on its/their ownhist(d$RT_log); hist(d$RT_log, breaks="FD") # we already did that anyway
table(d$SITE)
a b c
2403 2815 2534
# the predictor(s) w/ the responseqwe <-boxplot(main="RT (in ms, logged) ~ experimental site", d$RT_log ~ d$SITE, notch=TRUE,xlab="Experimental site",ylab="RT in (in ms, logged)"); grid()
It is noteworthy that site c differs considerably from the others in the outlier distribution; this would definitely require some attention! [In fact, we will see something more interesting later.]
5.2 Modeling & numerical interpretation
Let’s fit our regression model:
summary(m_01 <-lm( # make/summarize the linear model m_01: RT_log ~1+# RT_log ~ an overall intercept (1) SITE, # & the predictor SITEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + SITE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.10891 -0.30613 -0.08053 0.23524 2.65471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.191055 0.009111 1008.810 <2e-16 ***
SITEb 0.120700 0.012404 9.731 <2e-16 ***
SITEc 0.166161 0.012717 13.066 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4466 on 7749 degrees of freedom
Multiple R-squared: 0.02295, Adjusted R-squared: 0.0227
F-statistic: 91 on 2 and 7749 DF, p-value: < 2.2e-16
This CI computation for the linear model is shown here, first manually again and then the easy way with Confint:
# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7749) # the t-value for 2.5% at that df
(Intercept)
9.173196
# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7749) # the t-value for 2.5% at that df
(Intercept)
9.208915
# lower bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df
SITEb
0.09638472
# upper bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that df
SITEb
0.1450159
# lower bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df
SITEc
0.1412324
# upper bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that df
What is the deviance of our model with its one predictor (but 2 contrasts from the intercept)?
deviance(m_01)
[1] 1545.65
Again pretty bad – how much is the reduction in %?
"/"(deviance(m_00)-deviance(m_01),deviance(m_00))
[1] 0.02294739
As usual, we add the predictions of the model to the original data frame:
# replace this with your answers/codehead(d, 3) # check ...
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE RT_log
1 1 748 5 spanish heritage correct 19 4.278754 b 9.546894
2 2 408 5 spanish heritage correct 78 4.892095 b 8.672425
3 3 483 6 spanish heritage correct 233 5.367356 b 8.915879
tail(d, 3) # the result
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE RT_log
7998 8608 947 5 english english correct 80 4.903090 a 9.887221
7999 8609 660 4 english english correct 746 5.872739 a 9.366322
8000 8610 437 5 english english correct 54 4.732394 c 8.771489
5.3 Visual interpretation
Let’s visualize the nature of the effect of SITE based on the predictions:
(sit_d <-data.frame( # make sit_d a data frame of sit <-effect ( # the effect"SITE", # of SITE m_01))) # in m_01
SITE fit se lower upper
1 a 9.191055 0.009110786 9.173196 9.208915
2 b 9.311756 0.008417702 9.295255 9.328256
3 c 9.357216 0.008872161 9.339825 9.374608
plot(sit, # plot the effect of SITExlab="Site", # w/ this x-axis labelylab="RT (in ms, logged)", # w/ this y-axis labelylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, the confidence intervals of the predictions, and a horizontal decile grid:
stripchart(main="RT_log ~ SITE", # stripchart w/ main headingpch=16, col="#00000010", # filled semi-transparent grey circles d$RT_log ~ d$SITE, # data: RT_log ~ SITExlab="RT (in ms, logged)", ylab="Site", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT_log, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &for (i inseq(sit_d$fit)) { # their confidence intervalsarrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")}
Excursus: This is an interesting case because only the very last plot shows something that we haven’t noticed before, namely the fact that site c has a surprising gap of RT values around 600 ms (specifically, no values in the interval [590, 602], check out the arrow in the plot below); we can see this much more clearly in what is one of my favorite exploratory plots anyway, an ecdf plot:
plot(main="RT (in ms, logged) ~ experimental site", type="n",xlab="RT (in ms, logged)" , xlim=c(7.75, 12.25), x=0,ylab="Cumulative proportion", ylim=c(0 , 1 ), y=0); grid()for (i in1:3) { # for each of the 3 siteslines(pch=".", # plot lines w/ this point character# namely ecdf lines using these colorsecdf(d$RT_log[d$SITE==levels(d$SITE)[i]]), col=rainbow(3)[i]) #}legend(title="Site", # add a legend with the title "Site"x=10.5, xjust=0.5, # at these x-coordinates (centered)y=0.5, yjust=0.5, # at these y-coordinates (centered)legend=levels(d$SITE), # these are the labels, show them ...fill=rainbow(3), # w/ these colorsncol=3, bty="n") # organize the legend in 3 columns, no boxarrows(11, 0.7, 10.2, 0.9, angle=15, len=0.2)
5.4 Excursus: model comparison
In cases where one has multiple levels of a categorical predictor, sometimes the question arises as to whether (all) levels of that predictor are in fact significantly different from each other. In this particular case, the effects plot for SITE very clearly suggests that they are (because the confidence intervals of all predicted means do not overlap), but let’s discuss briefly how this can be tested with a method called model comparison: If you have two models that were fit to the same data set and where one model is a sub-model of the other model – typically such that they differ only by one predictor/term – then you can use the function anova to compare the two models to see whether the more complex model can justify its higher degree of complexity by being significantly better than the less complex model. In this case, 2 of the 3 possible comparisons of SITE are already in the summary table:
From this, we know that SITE: a is significantly different from SITE: b (difference=0.1207003, p=2.992632e-22) and SITE: c (difference=0.1661612, p=1.313862e-38), but we don’t know whether the 0.1661612-0.1207003=0.0454609 difference between SITE: b and SITE: c is significant as well. To find that out, we can compare m_01 to a model in which SITE: b and SITE: c are not distinguished. For that, we first, we create a version of SITE that conflates levels as desired:
d$SITE_confl <- d$SITE # create a copy of SITElevels(d$SITE_confl) <-# overwrite the copy's levels such thatc("a", # "a" remains "a""b/c", "b/c") # "b" & "c" become "b/c"table(SITE=d$SITE, SITE_confl=d$SITE_confl) # check you did it right
SITE_confl
SITE a b/c
a 2403 0
b 0 2815
c 0 2534
head(d)
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE RT_log
1 1 748 5 spanish heritage correct 19 4.278754 b 9.546894
2 2 408 5 spanish heritage correct 78 4.892095 b 8.672425
3 3 483 6 spanish heritage correct 233 5.367356 b 8.915879
4 4 1266 4 spanish heritage correct 133 5.123852 b 10.306062
5 5 636 6 spanish heritage correct 14 4.146128 b 9.312883
6 6 395 6 spanish heritage correct 67 4.826075 b 8.625709
SITE_confl
1 b/c
2 b/c
3 b/c
4 b/c
5 b/c
6 b/c
We already have the model with SITE as the only predictor (m_01) so we only need to fit a new model m_01_siteconfl with SITE_confl as the only predictor:
summary(m_01_siteconfl <-lm( # make/summarize the linear model m_01_siteconfl: RT_log ~1+ SITE_confl, # RT_log ~ an overall intercept (1) & SITE_confldata=d, na.action=na.exclude)) # the other arguments
Call:
lm(formula = RT_log ~ 1 + SITE_confl, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.10891 -0.30841 -0.08063 0.23555 2.67863
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.191055 0.009118 1007.98 <2e-16 ***
SITE_conflb/c 0.142237 0.010977 12.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.447 on 7750 degrees of freedom
Multiple R-squared: 0.02121, Adjusted R-squared: 0.02108
F-statistic: 167.9 on 1 and 7750 DF, p-value: < 2.2e-16
… and then we do a comparison with anova:
anova(m_01, # compare the model w/ the 3-level predictor SITE m_01_siteconfl, # & the model w/ the 2-level predictor SITE_confltest="F") # using an F-test
Analysis of Variance Table
Model 1: RT_log ~ 1 + SITE
Model 2: RT_log ~ 1 + SITE_confl
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7749 1545.7
2 7750 1548.4 -1 -2.7561 13.817 0.0002029 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the two models is highly significant: Giving up the distinction between SITE: b and SITE: c makes the model not just worse, but significantly worse, meaning the finer resolution of SITE in m_01 compared to SITE_confl in m_01_siteconfl makes that model m_01 significantly better (p<0.001). Thus, we now know not only the size of the 3rd comparison (SITE: b vs. c, which was clear from summary(m_01)), but also its p-value.
Another way of getting such results is available from emmeans::emmeans. Here,
lines 2 and 3 of the model’s summary table correspond to lines 1 and 2 of the emmeans output;
line 3 of the emmeans output corresponds to the result we obtained with the model conflation approach above: the F-value of the model conflation approach is the squared t-value of the emmeans approach:
summary(m_01)$coefficients # repeating what the coefs were
emmeans( # the predicted means computedobject=m_01, # for the model m_01~ SITE) # namely all contrasts of this predictor
SITE emmean SE df lower.CL upper.CL
a 9.191 0.00911 7749 9.173 9.209
b 9.312 0.00842 7749 9.295 9.328
c 9.357 0.00887 7749 9.340 9.375
Confidence level used: 0.95
pairs( # compute pairwise comparisonsemmeans( # of predicted means computedobject=m_01, # for the model m_01~ SITE), # namely all contrasts of this predictoradjust="none") # don't adjust for 3 comparisons (careful w/ this)
contrast estimate SE df t.ratio p.value
a - b -0.1207 0.0124 7749 -9.731 <.0001
a - c -0.1662 0.0127 7749 -13.066 <.0001
b - c -0.0455 0.0122 7749 -3.717 0.0002
5.5 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the site where the data were collected (a vs. b vs. c), a linear model was fit with RT as the response variable and SITE as a predictor. The model was highly significant (F=112, df1=2, df2=7749, p<0.0001) but explains very little of the response variable (mult. R2=0.028, adj. R2=0.028). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail (esp. for SITE: c).
Estimate
95%-CI
se
t
p2-tailed
Intercept (SITE=a)
9.191
[9.173, 9.209]
0.009
1008.81
<0.001
SITEa→b
0.121
[0.096, 0.145]
0.012
9.73
<0.001
SITEa→c
0.166
[0.141, 0.191]
0.013
13.07
<0.001
A post-hoc comparison of the above model with one in which sites b and c were conflated indicates that sites b and c are also significantly different from each other (difference between means: 0.046, F=13.817/t=-3.717, punadjusted<0.001).
6 Homework
To prepare for next session,
read SFLWR3,
Section 5.1 on multifactoriality;
Section 5.2.1-5.2.3 on linear regression modeling;
---title: "Ling 202: session 02: linear regr. 1 (key)"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2025-04-02 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 6 fig-height: 6 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# IntroLet's load a data set for a few linear modeling examples; the data are in [_input/rts.csv](_input/rts.csv) and you can find information about the variables/columns in [_input/rts.r](_input/rts.r).```{r}rm(list=ls(all.names=TRUE))library(car); library(effects); library(emmeans)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors```As in the book, we will first discuss some didactically motivated monofactorial models before we go over a 'proper' multifactorial modeling approach. But first we change one factor in the data (for purely didactic reasons) and get rid of the missing data:```{r}d$LANGUAGE <-factor(d$LANGUAGE, # for didactic reasons only, I amlevels=levels(d$LANGUAGE)[2:1]) # changing the order of the levelssummary(d <-droplevels(d[complete.cases(d),]))```# DevianceA useful thing to consider is **deviance**, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable's values are not all the mean, they vary around it:```{r}summary(d$RT)```One way to express the degree to which the values of `RT` vary around the mean would be the **average absolute deviation**:```{r}mean(abs(d$RT -mean(d$RT, na.rm=TRUE)), na.rm=TRUE)```But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called **deviance**:```{r}sum((d$RT -mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)```This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called **null model**, i.e. a model with no predictors:```{r}summary(m_00 <-lm( # make/summarize the linear model m_00: RT ~1, # RT ~ an overall intercept (1) & no predictordata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```That model's intercept is the overall mean of the response variable and that model's deviance is the same as what we computed manually just above:```{r}deviance(m_00)```Note that this null model actually returns the same result as when you use a one-sample *t*-test on the reaction times; compare the mean and the *t*-value:```{r}t.test(d$RT)```And this is how we'd get the confidence intervals that `t.test` provides from the regression model:```{r}Confint(m_00)```And the manual computation would be this:```{r}# lower bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.025, df=7751) # the t-value for 2.5% at that df# upper bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.975, df=7751) # the t-value for 2.5% at that df```Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number ...# A numeric predictorDoes the reaction time to a word (in ms) vary as a function of the frequency of that word?## Exploration & preparationHow would we want to explore the response variable?```{r}hist(d$RT); hist(d$RT, breaks="FD")```We can't really work with that. What if we log this?```{r}hist(log2(d$RT)); hist(log2(d$RT), breaks="FD")```Much better, that's what we need! We add the logged version to the data:```{r}d$RT_log <-log2(d$RT)```That means we need to compute the deviance again!```{r}summary(m_00 <-lm( # make/summarize the linear model m_00: RT_log ~1, # RT_log ~ an overall intercept (1) & no predictordata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)deviance(m_00)```Now the predictor:```{r}hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")plot(main="RT (in ms, logged) as a function of frequency per million words",pch=16, col="#00000010",xlab="Frequency pmw" , xlim=c(0, 1200), x=d$FREQPMW,ylab="RT (in ms, logged)", ylim=c(8, 12 ), y=d$RT_log); grid()abline(lm(d$RT_log ~ d$FREQPMW), col="red", lwd=2)```Again, this doesn't look good. Thus, we ask, does the reaction time to a word (in ms) vary as a function of a **logged version** of the frequency of that word? For that we use the already provided variable `ZIPFFREQ` so let's explore the new predictor:```{r}# the new predictor(s) on its/their ownhist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")# the predictor(s) w/ the responseplot(main="RT (in ms, logged) as a function of Zipf frequency",pch=16, col="#00000010",xlab="Zipf frequency" , xlim=c(3, 6), x=d$ZIPFFREQ,ylab="RT (in ms, logged)", ylim=c(8, 12), y=d$RT_log); grid()abline(lm(d$RT_log ~ d$ZIPFFREQ), col="red", lwd=2)```There we go ...## Modeling & numerical interpretationHow would we have dealt with this based on Chapter 4 alone? We would have computed a Pearson product-moment correlation coefficient *r* (or spearman's ρ, either with `cor.test`) and maybe a simple linear model (with `lm`). Thus, we might have done something like this (for Pearson's *r*) ...```{r}cor.test(d$RT_log, d$ZIPFFREQ)```... or this for the simplest possible linear regression model:```{r }summary(m_01 <- lm( # make/summarize the linear model m_01: RT_log ~ 1 + # RT_log ~ an overall intercept (1) ZIPFFREQ, # & the predictor ZIPFFREQ data=d, # those variables are in d na.action=na.exclude)) # skip cases with NA/missing data (good habit)```Note the equivalence of the classical Pearson's *r* and the linear model we fit:* the *t*-statistic from Pearson's *r* (`r cor.test(d$RT_log, d$ZIPFFREQ)$statistic`) squared is `m_01`'s *F*-value (`r summary(m_01)[["fstatistic"]][1]`);* the *t*-statistic from Pearson's *r* (`r cor.test(d$RT_log, d$ZIPFFREQ)$statistic`) is also the *t*-statistic of `ZIPFFREQ` (`r summary(m_01)$coef[2,3]`);* the *df* of Pearson's *r* (`r cor.test(d$RT_log, d$ZIPFFREQ)$parameter`) is also the *df*~2~ of `m_01` (`r summary(m_01)[["fstatistic"]][3]`);* the value of Pearson's *r* (`r cor.test(d$RT_log, d$ZIPFFREQ)$estimate`) squared is `m_01`'s *Multiple R-squared* (`r summary(m_01)[["r.squared"]]`).What is the confidence interval of the coefficients?```{r}Confint(m_01)```And the manual computation would be this:```{r}# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df# lower bound for the slope:coef(m_01)[2] +# the slope plussummary(m_01)$coef[2,2] *# the slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the slope:coef(m_01)[2] +# the slope plussummary(m_01)$coef[2,2] *# the slope's seqt(0.975, df=7750) # the t-value for 2.5% at that df```What is the deviance of our model with its one predictor?```{r}deviance(m_01)```Wow, that doesn't seem to have reduced the deviance much from that of the null model, which was `r deviance(m_00)`. How much is the reduction in %?```{r}"/"(deviance(m_00)-deviance(m_01),deviance(m_00))```Does that number look familiar?Let's now also add the predictions of the model to the original data frame (usually a good (house-keeping) habit to adopt):```{r}d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d) # check the result```## Visual interpretationLet's visualize the nature of the effect of `ZIPFFREQ` based on the predictions:```{r}(zpf_d <-data.frame( # make zpf_d a data frame of zpf <-effect( # the effect"ZIPFFREQ", # of ZIPFFREQ m_01))) # in m_01plot(zpf, # plot the effect of ZIPFFREQxlab="Zipf frequency", # w/ this x-axis labelylab="RT (in ms, logged)", # w/ this y-axis labelylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid```And here a version with everything: observed values, predicted values, and the confidence band of the predictions/regression line:```{r}plot(main="RT (in ms, logged) as a function of Zipf frequency",pch=16, col="#00000010",xlab="Zipf frequency" , xlim=c(3, 6), x=d$ZIPFFREQ,ylab="RT (in ms, logged)", ylim=c(8, 12), y=d$RT_log); grid()axis(4, at=axTicks(2), labels=round(2^axTicks(2), 1))# add the regression line:abline(m_01, col="red", lwd=2)# add the confidence band:polygon(c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),c(zpf_d$lower, rev(zpf_d$upper)),col="#FF000030", border=NA)```## Write-upWe wanted to determine whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words). However, preliminary exploration indicated that both the response `RT` and the predictor `FREQPMW` were extremely skewed with high outliers, which is why* the response variable was transformed with the binary log;* the predictor `FREQPMW` was replaced by the (inherently log-scaled) predictor `ZIPFFREQ`.Then, a linear model was fit with `RT_log` as the response variable and `ZIPFFREQ` as a predictor. The model was highly significant (*F*=167.4, *df*~1~=1, *df*~2~=7750, *p*<0.0001) but explains very little of the response variable (mult. *R*^2^=0.021, adj. *R*^2^=0.021). The coefficients indicate that `ZIPFFREQ` is negatively correlated with `RT_log`, as shown in the summary table and the figure shown here. [Show a plot]| | Estimate | 95%-CI | *se* | *t* | *p*~2-tailed~ ||:-----------------------|:---------|:-----------------|:------|:-------|:--------------|| Intercept (ZIPFFREQ=0) | 9.842 | [9.758, 9.926] | 0.043 | 228.72 | <0.001 || ZIPFFREQ~0→1~ | -0.12 | [-0.138, -0.102] | 0.009 | -12.94 | <0.001 |```{r}# housekeeping: remove the predictionsstr(d <- d[,1:10])```# A binary predictorDoes the reaction time to a word (in ms) vary as a function of which language that word was presented in (*Spanish* vs. *English*)?## Exploration & preparationSome exploration of the relevant variables:```{r}# the predictor(s)/response on its/their ownhist(d$RT_log); hist(d$RT_log, breaks="FD") # we already did that anywaytable(d$LANGUAGE)# the predictor(s) w/ the responseqwe <-boxplot(main="RT (in ms, logged) as a function of stimulus language", d$RT_log ~ d$LANGUAGE, notch=TRUE,xlab="Stimulus language",ylab="RT (in ms, logged)"); grid()```## Modeling & numerical interpretationHow would we have dealt with this based on Chapter 4 alone? We would have computed a *t*-test for independent samples (with `t.test`) like this:```{r}t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE) # the default t-test (Student)```This *t*-test is actually also just a very simple linear regression model:```{r}summary(m_01 <-lm( # make/summarize the linear model m_01: RT_log ~1+# RT_log ~ an overall intercept (1) LANGUAGE, # & the predictor LANGUAGEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```Note the equivalence of the classical *t*-test and the linear model we fit (with the exception of the direction of the differences):* the *t*-statistic from the *t*-test (`r t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE)$statistic`) squared is `m_01`'s *F*-value (`r summary(m_01)[["fstatistic"]][1]`);* the *t*-statistic from the *t*-test (`r t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE)$statistic`) is also the (negative of the) *t*-statistic of `LANGUAGEspanish` (`r summary(m_01)$coef[2,3]`);* the *df* of the *t*-test (`r t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE)$parameter`) is also the *df*~2~ of `m_01` (`r summary(m_01)[["fstatistic"]][3]`);* the first mean reported by the *t*-test (*mean in group spanish*, of `r t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE)$estimate[1]`) is the intercept of `m_01` (of `r coef(m_01)[1]`);* the second mean reported by the *t*-test (*mean in group english*, of `r t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE)$estimate[2]`) is the sum of intercept of `m_01` (of `r coef(m_01)[1]`) and the mean difference/slope of `m_01` (of `r coef(m_01)[2]`);* the confidence interval of the *t*-test (`r as.numeric(t.test(d$RT_log ~ d$LANGUAGE, var.equal=TRUE)$conf.int)`) is the (negative) coefficient for `LANGUAGEenglish` (of `r coef(m_01)[2]`) plus/minus the product of + its standard error (of `r summary(m_01)$coef[2,2]`) and + the *t*-value that cuts off 2.5% of the area under the curve for *df*=7750 (`qt(p=0.025, df=7750)`, i.e. `r qt(p=0.025, df=7750)`).This CI computation for the linear model is shown here, first manually again and then the easy way with `Confint`:```{r}# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df# lower bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7750) # the t-value for 2.5% at that dfConfint(m_01)```Back to the model:What is the deviance of our model with its one predictor?```{r}deviance(m_01)```This seems to have reduced the deviance somewhat from that of the null model, which was `r deviance(m_00)` -- again, how much is the reduction in %?```{r}"/"(deviance(m_00)-deviance(m_01),deviance(m_00))```Which I am again hoping is familiar by now ...We again add the predictions of the model to the original data frame:```{r}d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d, 3) # check ...tail(d, 3) # the result```## Visual interpretationLet's visualize the nature of the effect of `LANGUAGE` based on the predictions:```{r}(lan_d <-data.frame( # make lan_d a data frame of lan <-effect( # the effect"LANGUAGE", # of LANGUAGE m_01))) # in m_01plot(lan, # plot the effect of LANGUAGExlab="Language", # w/ this x-axis labelylab="RT (in ms, logged)", # w/ this y-axis labelylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid```And here a version with everything: observed values, predicted values, the confidence intervals of the predictions, and a horizontal decile grid:```{r}stripchart(main="RT_log ~ LANGUAGE", # stripchart w/ main headingpch=16, col="#00000010", # filled semi-transparent grey circles d$RT_log ~ d$LANGUAGE, # data: RT_log ~ LANGUAGExlab="Language", ylab="RT in ms", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# the next lne creates a horizontal decile gridabline(h=quantile(d$RT_log, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &for (i inseq(lan_d$fit)) { # their confidence intervalsarrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")}```## Write-upTo determine whether the reaction time to a word (in ms) varies as a function of the language the word was presented in (*english* vs. *spanish*), a linear model was fit with `RT` as the response variable and `LANGUAGE` as a predictor. The model was highly significant (*F*=737.4, *df*~1~=1, *df*~2~=7750, *p*<0.0001) but explains very little of the response variable (mult. *R*^2^=0.087, adj. *R*^2^=0.087). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot]| | Estimate | 95%-CI | *se* | *t* | *p*~2-tailed~ ||:-----------------------------------|:---------|:-----------------|:------|:--------|:--------------|| Intercept (LANGUAGE=*english*) | 9.426 | [9.412, 9.44] | 0.007 | 1341.44 | <0.001 || LANGUAGE~*english*→*spanish*~ | -0.266 | [-0.286, -0.247] | 0.01 | -27.16 | <0.001 |```{r}# housekeeping: remove the predictionsstr(d <- d[,1:10])```# A categorical predictorDoes the reaction time to a word (in ms) vary as a function of in which of three experimental sites the data were collected (*a* vs. *b* vs. *c*)?## Exploration & preparationSome exploration of the relevant variables:```{r}# the predictor(s)/response on its/their ownhist(d$RT_log); hist(d$RT_log, breaks="FD") # we already did that anywaytable(d$SITE)``````{r}#| fig-show: hold# the predictor(s) w/ the responseqwe <-boxplot(main="RT (in ms, logged) ~ experimental site", d$RT_log ~ d$SITE, notch=TRUE,xlab="Experimental site",ylab="RT in (in ms, logged)"); grid()```It is noteworthy that site *c* differs considerably from the others in the outlier distribution; this would definitely require some attention! [In fact, we will see something more interesting later.]## Modeling & numerical interpretationLet's fit our regression model:```{r}summary(m_01 <-lm( # make/summarize the linear model m_01: RT_log ~1+# RT_log ~ an overall intercept (1) SITE, # & the predictor SITEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```This CI computation for the linear model is shown here, first manually again and then the easy way with `Confint`:```{r}# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7749) # the t-value for 2.5% at that df# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7749) # the t-value for 2.5% at that df# lower bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df# upper bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that df# lower bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df# upper bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that dfConfint(m_01)```What is the deviance of our model with its one predictor (but 2 contrasts from the intercept)?```{r}deviance(m_01)```Again pretty bad -- how much is the reduction in %?```{r}"/"(deviance(m_00)-deviance(m_01),deviance(m_00))```As usual, we add the predictions of the model to the original data frame:```{r}# replace this with your answers/codehead(d, 3) # check ...tail(d, 3) # the result```## Visual interpretationLet's visualize the nature of the effect of `SITE` based on the predictions:```{r}(sit_d <-data.frame( # make sit_d a data frame of sit <-effect ( # the effect"SITE", # of SITE m_01))) # in m_01plot(sit, # plot the effect of SITExlab="Site", # w/ this x-axis labelylab="RT (in ms, logged)", # w/ this y-axis labelylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid```And here a version with everything: observed values, predicted values, the confidence intervals of the predictions, and a horizontal decile grid:```{r}stripchart(main="RT_log ~ SITE", # stripchart w/ main headingpch=16, col="#00000010", # filled semi-transparent grey circles d$RT_log ~ d$SITE, # data: RT_log ~ SITExlab="RT (in ms, logged)", ylab="Site", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT_log, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &for (i inseq(sit_d$fit)) { # their confidence intervalsarrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")}```Excursus: This is an interesting case because only the very last plot shows something that we haven't noticed before, namely the fact that site *c* has a surprising gap of RT values around 600 ms (specifically, no values in the interval [590, 602], check out the arrow in the plot below); we can see this much more clearly in what is one of my favorite exploratory plots anyway, an ecdf plot:```{r}plot(main="RT (in ms, logged) ~ experimental site", type="n",xlab="RT (in ms, logged)" , xlim=c(7.75, 12.25), x=0,ylab="Cumulative proportion", ylim=c(0 , 1 ), y=0); grid()for (i in1:3) { # for each of the 3 siteslines(pch=".", # plot lines w/ this point character# namely ecdf lines using these colorsecdf(d$RT_log[d$SITE==levels(d$SITE)[i]]), col=rainbow(3)[i]) #}legend(title="Site", # add a legend with the title "Site"x=10.5, xjust=0.5, # at these x-coordinates (centered)y=0.5, yjust=0.5, # at these y-coordinates (centered)legend=levels(d$SITE), # these are the labels, show them ...fill=rainbow(3), # w/ these colorsncol=3, bty="n") # organize the legend in 3 columns, no boxarrows(11, 0.7, 10.2, 0.9, angle=15, len=0.2)```## Excursus: model comparisonIn cases where one has multiple levels of a categorical predictor, sometimes the question arises as to whether (all) levels of that predictor are in fact significantly different from each other. In this particular case, the effects plot for `SITE` very clearly suggests that they are (because the confidence intervals of all predicted means do not overlap), but let's discuss briefly how this can be tested with a method called **model comparison**: If you have two models that were fit to the same data set and where one model is a sub-model of the other model -- typically such that they differ only by one predictor/term -- then you can use the function `anova` to compare the two models to see whether the more complex model can justify its higher degree of complexity by being significantly better than the less complex model. In this case, 2 of the 3 possible comparisons of `SITE` are already in the summary table:```{r}summary(m_01)$coefficients```From this, we know that `SITE`: *a* is significantly different from `SITE`: *b* (difference=0.1207003, *p*=2.992632e-22) and `SITE`: *c* (difference=0.1661612, *p*=1.313862e-38), but we don't know whether the 0.1661612-0.1207003=0.0454609 difference between `SITE`: *b* and `SITE`: *c* is significant as well. To find that out, we can compare `m_01` to a model in which `SITE`: *b* and `SITE`: *c* are not distinguished. For that, we first, we create a version of `SITE` that conflates levels as desired:```{r}d$SITE_confl <- d$SITE # create a copy of SITElevels(d$SITE_confl) <-# overwrite the copy's levels such thatc("a", # "a" remains "a""b/c", "b/c") # "b" & "c" become "b/c"table(SITE=d$SITE, SITE_confl=d$SITE_confl) # check you did it righthead(d)```We already have the model with `SITE` as the only predictor (`m_01`) so we only need to fit a new model `m_01_siteconfl` with `SITE_confl` as the only predictor:```{r}summary(m_01_siteconfl <-lm( # make/summarize the linear model m_01_siteconfl: RT_log ~1+ SITE_confl, # RT_log ~ an overall intercept (1) & SITE_confldata=d, na.action=na.exclude)) # the other arguments```... and then we do a comparison with `anova`:```{r}anova(m_01, # compare the model w/ the 3-level predictor SITE m_01_siteconfl, # & the model w/ the 2-level predictor SITE_confltest="F") # using an F-test```The difference between the two models is highly significant: Giving up the distinction between `SITE`: *b* and `SITE`: *c* makes the model not just worse, but significantly worse, meaning the finer resolution of `SITE` in `m_01` compared to `SITE_confl` in `m_01_siteconfl` makes that model `m_01` significantly better (*p*<0.001). Thus, we now know not only the size of the 3rd comparison (`SITE`: *b* vs. *c*, which was clear from `summary(m_01)`), but also its *p*-value.Another way of getting such results is available from `emmeans::emmeans`. Here,* lines 2 and 3 of the model's summary table correspond to lines 1 and 2 of the `emmeans` output;* line 3 of the `emmeans` output corresponds to the result we obtained with the model conflation approach above: the *F*-value of the model conflation approach is the squared *t*-value of the `emmeans` approach:```{r}summary(m_01)$coefficients # repeating what the coefs wereemmeans( # the predicted means computedobject=m_01, # for the model m_01~ SITE) # namely all contrasts of this predictorpairs( # compute pairwise comparisonsemmeans( # of predicted means computedobject=m_01, # for the model m_01~ SITE), # namely all contrasts of this predictoradjust="none") # don't adjust for 3 comparisons (careful w/ this)```## Write-upTo determine whether the reaction time to a word (in ms) varies as a function of the site where the data were collected (*a* vs. *b* vs. *c*), a linear model was fit with `RT` as the response variable and `SITE` as a predictor. The model was highly significant (*F*=112, *df*~1~=2, *df*~2~=7749, *p*<0.0001) but explains very little of the response variable (mult. *R*^2^=0.028, adj. *R*^2^=0.028). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail (esp. for `SITE`: *c*).| | Estimate | 95%-CI | *se* | *t* | *p*~2-tailed~ ||:---------------------|:---------|:---------------|:------|:--------|:--------------|| Intercept (SITE=*a*) | 9.191 | [9.173, 9.209] | 0.009 | 1008.81 | <0.001 || SITE~*a*→*b*~ | 0.121 | [0.096, 0.145] | 0.012 | 9.73 | <0.001 || SITE~*a*→*c*~ | 0.166 | [0.141, 0.191] | 0.013 | 13.07 | <0.001 |A post-hoc comparison of the above model with one in which sites *b* and *c* were conflated indicates that sites *b* and *c* are also significantly different from each other (difference between means: 0.046, *F*=13.817/*t*=-3.717, *p*~unadjusted~<0.001).# HomeworkTo prepare for next session,* read *SFLWR*^3^, + Section 5.1 on multifactoriality; + Section 5.2.1-5.2.3 on linear regression modeling; + Section 5.5 on model selection.# Session info```{r}sessionInfo()```