Let’s load a data set for a few linear modeling examples; the data are in _input/rts.csv and you can find information about the variables/columns in _input/rts.r.
rm(list=ls(all.names=TRUE))library(car); library(effects); library(magrittr)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE RT LENGTH LANGUAGE GROUP
Min. : 1 Min. : 271.0 Min. :3.000 english:4023 english :3961
1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spanish:3977 heritage:4039
Median :4310 Median : 595.0 Median :5.000
Mean :4303 Mean : 661.5 Mean :5.198
3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
Max. :8610 Max. :4130.0 Max. :9.000
NA's :248
CORRECT FREQPMW ZIPFFREQ SITE
correct :7749 Min. : 1.00 Min. :3.000 a:2403
incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
Median : 42.00 Median :4.623 c:2782
Mean : 81.14 Mean :4.591
3rd Qu.: 101.00 3rd Qu.:5.004
Max. :1152.00 Max. :6.061
As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.
2 Deviance
A useful thing to consider is deviance, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable’s values are not all the mean, they vary around it:
summary(d$RT)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
271.0 505.0 595.0 661.5 732.0 4130.0 248
One way to express the degree to which the values of RT vary around the mean would be the average absolute deviation:
But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called deviance:
sum((d$RT -mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)
[1] 536471586
This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called null model, i.e. a model with no predictors:
summary(m_00 <-lm( # make/summarize the linear model m_00: RT ~1, # RT ~ an overall intercept (1) & no predictordata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-390.5 -156.5 -66.5 70.5 3468.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 661.469 2.988 221.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 263.1 on 7751 degrees of freedom
(248 observations deleted due to missingness)
That model’s intercept is the overall mean of the response variable and that model’s deviance is the same as what we computed manually just above:
deviance(m_00)
[1] 536471586
Note that this null model actually returns the same result as when you use a one-sample t-test on the reaction times; compare the mean and the t-value:
t.test(d$RT)
One Sample t-test
data: d$RT
t = 221.37, df = 7751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
655.6111 667.3259
sample estimates:
mean of x
661.4685
And this is how we’d get the confidence intervals that t.test provides from the regression model:
# lower bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.025, df=7751) # the t-value for 2.5% at that df
(Intercept)
655.6111
# upper bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.975, df=7751) # the t-value for 2.5% at that df
(Intercept)
667.3259
Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …
3 A numeric predictor
Does the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?
3.1 Exploration & preparation
Some exploration of the response variable:
hist(d$RT); hist(d$RT, breaks="FD")
Very long tail on the right … Let’s see where ‘the middle 90%’ of the data are, which might be useful as a good plotting range:
Ok, so the middle 90% of the data are in [417, 1137.45], which means that some values like that might make for good y-axis limits (e.g. for some plots). Now the predictor:
hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")
# the predictor(s) w/ the responseplot(main="RT in ms as a function of frequency per million words",pch=16, col="#00000030",xlab="Frequency pmw", xlim=c( 0, 1200), x=d$FREQPMW,ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()abline(lm(d$RT ~ d$FREQPMW), col="red", lwd=2)
3.2 Modeling & numerical interpretation
How would we have dealt with this in Ling 104? We would have computed a Pearson product-moment correlation coefficient r (or Spearman’s ρ, either with cor.test) and maybe a simple linear model (with lm). Thus, we might have done something like this (for Pearson’s r) …
cor.test(d$RT, d$FREQPMW)
Pearson's product-moment correlation
data: d$RT and d$FREQPMW
t = -5.8629, df = 7750, p-value = 4.734e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.08858155 -0.04425515
sample estimates:
cor
-0.06645113
… or this for the simplest possible linear regression model:
summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+# RT ~ an overall intercept (1) FREQPMW, # & the predictor FREQPMWdata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-396.3 -156.1 -68.9 72.4 3468.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 673.83892 3.65267 184.479 < 2e-16 ***
FREQPMW -0.14926 0.02546 -5.863 4.73e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 262.5 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.004416, Adjusted R-squared: 0.004287
F-statistic: 34.37 on 1 and 7750 DF, p-value: 4.734e-09
Note the equivalence of the classical Pearson’s r and the linear model we fit:
the t-statistic from Pearson’s r (-5.8629236) squared is m_01’s F-value (34.3738736);
the t-statistic from Pearson’s r (-5.8629236) is also the t-statistic of FREQPMW (-5.8629236);
the df of Pearson’s r (7750) is also the df2 of m_01 (7750);
the value of Pearson’s r (-0.0664511) squared is m_01’s Multiple R-squared (0.0044158).
But that correlation/model can’t be ‘right’, given the descriptive/exploratory plots. Thus, we ask, does the reaction time to a word (in ms) vary as a function of a logged version of the frequency of that word? For that we use the already provided variable ZIPFFREQ so let’s explore the new predictor:
# the new predictor(s) on its/their ownhist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")
# the predictor(s) w/ the responseplot(main="RT in ms as a function of frequency per million words",pch=16, col="#00000030",xlab="Zipf frequency", xlim=c( 3, 6), x=d$ZIPFFREQ,ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()abline(lm(d$RT ~ d$ZIPFFREQ), col="red", lwd=2)
Let’s fit our new/second regression model:
summary(m_02 <-lm( # make/summarize the linear model m_02: RT ~1+# RT ~ an overall intercept (1) ZIPFFREQ, # & the predictor ZIPFFREQ (which involves logging)data=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1 + ZIPFFREQ, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-411.4 -153.1 -64.6 71.5 3486.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 949.36 25.11 37.80 <2e-16 ***
ZIPFFREQ -62.46 5.41 -11.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 260.9 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.01691, Adjusted R-squared: 0.01678
F-statistic: 133.3 on 1 and 7750 DF, p-value: < 2.2e-16
What are the confidence intervals of the coefficients?
# lower bound for the intercept:coef(m_02)[1] +# the intercept plussummary(m_02)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df
(Intercept)
900.1318
# upper bound for the intercept:coef(m_02)[1] +# the intercept plussummary(m_02)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df
(Intercept)
998.5892
# lower bound for the slope:coef(m_02)[2] +# the slope plussummary(m_02)$coef[2,2] *# the slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df
ZIPFFREQ
-73.06186
# upper bound for the slope:coef(m_02)[2] +# the slope plussummary(m_02)$coef[2,2] *# the slope's seqt(0.975, df=7750) # the t-value for 2.5% at that df
ZIPFFREQ
-51.85124
What is the deviance of our model with its one predictor?
deviance(m_02)
[1] 527402124
Wow, that doesn’t seem to have reduced the deviance much from that of the null model, which was 5.3647159^{8}. How much is the reduction in %?
"/"(deviance(m_00)-deviance(m_02),deviance(m_00))
[1] 0.01690576
Does that number look familiar?
Let’s now also add the predictions of the model to the original data frame (usually a good (house-keeping) habit to adopt):
d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d) # check the result
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE PRED
1 1 748 5 spanish heritage correct 19 4.278754 b 671.0029
2 2 408 5 spanish heritage correct 78 4.892095 b 662.1964
3 3 483 6 spanish heritage correct 233 5.367356 b 639.0605
4 4 1266 4 spanish heritage correct 133 5.123852 b 653.9869
5 5 636 6 spanish heritage correct 14 4.146128 b 671.7492
6 6 395 6 spanish heritage correct 67 4.826075 b 663.8383
3.3 Visual interpretation
Let’s visualize the nature of the effect of ZIPFFREQ based on the predictions from effects:
(zpf_d <-data.frame( # make zpf_d a data frame of zpf <-effect( # the effect"ZIPFFREQ", # of ZIPFFREQ m_02))) # in m_02
plot(zpf, # plot the effect of ZIPFFREQ in m_02xlab="Zipf frequency", # w/ this x-axis labelylab="RT in ms", # w/ this y-axis labelylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid
plot(zpf, # plot the effect of ZIPFFREQxlab="Zipf frequency", # w/ this x-axis labelylab="RT in ms (middle 90%)", # w/ this y-axis labelylim=c(417, 1138), # w/ these y-axis limits (middle 90%)grid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, and the confidence band of the predictions/regression line:
par(mfrow=c(1,2))plot(main="RT in ms as a function of Zipf frequency",pch=16, col="#00000030",xlab="Zipf frequency", xlim=c( 3, 6), x=d$ZIPFFREQ,ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()# add the regression line:abline(m_02, col="red", lwd=2)# add the confidence band:polygon(c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),c(zpf_d$lower, rev(zpf_d$upper)),col="#FF000030", border=NA)# indicate the middle 90%:abline(h=c(417, 1138), lty=3, col="blue")plot(main="RT in ms as a function of Zipf frequency",frame.plot=FALSE, pch=16, col="#00000030",xlab="Zipf frequency" , xlim=c( 3, 6), x=d$ZIPFFREQ,ylab="RT in ms (middle 90%)", ylim=c(417, 1138), y=d$RT)grid(); box(col="blue", lty=3)# add the regression line:abline(m_02, col="red", lwd=2)# add the confidence band:polygon(col="#FF000030", border=NA,c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),c(zpf_d$lower, rev(zpf_d$upper)))par(mfrow=c(1,1))
3.4 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words), a linear model was fit with RT as the response variable and FREQPMW as a predictor. Exploration of all variables involved indicated the skewed distribution of FREQPMW, which is why a log-transformed version of this predictor was used in its place, ZIPFFREQ. The model was highly significant (F=133.3, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.017, adj. R2=0.017). The coefficients indicate that ZIPFFREQ is negatively correlated with RT, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail.
Estimate
95%-CI
se
t
p2-tailed
Intercept (ZIPFFREQ=0)
949.36
[900.13, 998.59]
25.11
37.8
<0.001
ZIPFFREQ0→1
-62.46
[-73.06, -51.06]
5.41
-11.54
<0.001
# housekeeping: remove the predictionsstr(d <- d[,1:9])
Does the reaction time to a word (in ms) vary as a function of which language that word was presented in (English vs. Spanish)?
4.1 Exploration & preparation
Some exploration of the relevant variables:
# the predictor(s)/response on its/their ownhist(d$RT); hist(d$RT, breaks="FD")
table(d$LANGUAGE)
english spanish
4023 3977
# the predictor(s) w/ the responseqwe <-boxplot(main="RT in ms as a function of stimulus language", d$RT ~ d$LANGUAGE, notch=TRUE,xlab="Stimulus language",ylab="RT in ms", ylim=c(250, 4250)); grid()
There are many outliers, let’s plot this again without them:
# the predictor(s) w/ the responseboxplot(main="RT in ms as a function of stimulus language", d$RT ~ d$LANGUAGE, notch=TRUE, outline=FALSE,xlab="Stimulus language",ylab="RT in ms (no 'outliers')"); grid()
4.2 Modeling & numerical interpretation
How would we have dealt with this in Ling 104?
Thus, we might have done something like this …
t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE) # the default t-test (Student)
Two Sample t-test
data: d$RT by d$LANGUAGE
t = -23.661, df = 7750, p-value < 2.2e-16
alternative hypothesis: true difference in means between group english and group spanish is not equal to 0
95 percent confidence interval:
-147.9268 -125.2911
sample estimates:
mean in group english mean in group spanish
594.9439 731.5528
This t-test is actually also just a very simple linear regression model:
summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+# RT ~ an overall intercept (1) LANGUAGE, # & the predictor LANGUAGEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1 + LANGUAGE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-428.6 -146.6 -58.9 69.1 3398.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 594.944 4.029 147.66 <2e-16 ***
LANGUAGEspanish 136.609 5.774 23.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 254.1 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.06737, Adjusted R-squared: 0.06725
F-statistic: 559.8 on 1 and 7750 DF, p-value: < 2.2e-16
Note the equivalence of the classical t-test and the linear model we fit (with the exception of the direction of the differences):
the t-statistic from the t-test (-23.6609065) squared is m_01’s F-value (559.838495);
the t-statistic from the t-test (-23.6609065) is also the (negative of the) t-statistic of LANGUAGEspa (23.6609065);
the df of the t-test (7750) is also the df2 of m_01 (7750);
the first mean reported by the t-test (mean in group eng, of 594.9439276) is the intercept of m_01 (of 594.9439276);
the second mean reported by the t-test (mean in group spa, of 731.5528477) is the sum of
the intercept of m_01 (of 594.9439276) and
the mean difference/slope of m_01 (of 136.6089201);
the confidence interval of the t-test (-147.9267615, -125.2910787) is the (negative) coefficient for LANGUAGEspa (of 136.6089201) plus/minus the product of
its standard error (of 5.7736131) and
the t-value that cuts off 2.5% of the area under the curve for df=7750 (qt(p=0.025, df=7750), i.e. -1.9602701).
The CI computation for the linear model is shown here, first manually again and then the easy way with Confint:
# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df
(Intercept)
587.046
# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df
(Intercept)
602.8419
# lower bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df
LANGUAGEspanish
125.2911
# upper bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7750) # the t-value for 2.5% at that df
Back to the model: What is the deviance of our model with its one predictor?
deviance(m_01)
[1] 500329194
This seems to have reduced the deviance much from that of the null model, which was 5.3647159^{8} – again, how much is the reduction in %?
"/"(deviance(m_00)-deviance(m_01),deviance(m_00))
[1] 0.06737056
Which I am again hoping is familiar by now …
We again add the predictions of the model to the original data frame:
d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d, 3) # check ...
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE PRED
1 1 748 5 spanish heritage correct 19 4.278754 b 731.5528
2 2 408 5 spanish heritage correct 78 4.892095 b 731.5528
3 3 483 6 spanish heritage correct 233 5.367356 b 731.5528
tail(d, 3) # the result
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE PRED
7998 8608 947 5 english english correct 80 4.903090 a 594.9439
7999 8609 660 4 english english correct 746 5.872739 a 594.9439
8000 8610 437 5 english english correct 54 4.732394 c 594.9439
4.3 Visual interpretation
Let’s visualize the nature of the effect of LANGUAGE based on the predictions from effects:
(lan_d <-data.frame( # make lan_d a data frame of lan <-effect( # the effect"LANGUAGE", # of LANGUAGE m_01))) # in m_01
LANGUAGE fit se lower upper
1 english 594.9439 4.029019 587.0460 602.8419
2 spanish 731.5528 4.135410 723.4463 739.6594
plot(lan, # plot the effect of LANGUAGE in m_01xlab="Language", # w/ this x-axis labelylab="RT in ms", # w/ this y-axis labelylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid
plot(lan, # plot the effect of LANGUAGExlab="Language", # w/ this x-axis labelylab="RT in ms (middle 90%)", # w/ this y-axis labelylim=c(417, 1138), # w/ these y-axis limits (middle 90%)grid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:
par(mfrow=c(1,2))stripchart(main="RT ~ LANGUAGE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$LANGUAGE, # data: RT ~ LANGUAGExlab="Language", ylab="RT in ms", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")abline(h=c(417, 1138), lty=3, col="blue")points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &for (i inseq(lan_d$fit)) { # their confidence intervalsarrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")}stripchart(main="RT ~ LANGUAGE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$LANGUAGE, # data: RT ~ LANGUAGExlab="Language", ylab="RT in ms (middle 90%)", # axis labelsylim=c(417, 1138), # y-axis limits: middle 90%vertical=TRUE, # make the chart have a vertical y-axismethod="jitter", # jitter observationsframe.plot=FALSE) # suppress boxbox(col="blue", lty=3)# add horizontal grid lines at deciles:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &for (i inseq(lan_d$fit)) { # their confidence intervalsarrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")}par(mfrow=c(1,1))
4.4 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the language the word was presented in (english vs. spanish), a linear model was fit with RT as the response variable and LANGUAGE as a predictor. The model was highly significant (F=559.8, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.067, adj. R2=0.067). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail.
Estimate
95%-CI
se
t
p2-tailed
Intercept (LANGUAGE=english)
594.94
[587.05, 602.84]
4.03
147.66
<0.001
LANGUAGEenglish→spanish
136.61
[125.29, 147.93]
5.77
23.66
<0.001
# housekeeping: remove the predictionsstr(d <- d[,1:9])
Does the reaction time to a word (in ms) vary as a function of in which of three experimental sites the data were collected (a vs. b vs. c)?
5.1 Exploration & preparation
Some exploration of the relevant variables:
# the predictor(s)/response on its/their ownhist(d$RT); hist(d$RT, breaks="FD")
table(d$SITE)
a b c
2403 2815 2782
# the predictor(s) w/ the responsepar(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columnsqwe <-boxplot(main="RT in ms ~ experimental site", d$RT ~ d$SITE, notch=TRUE,xlab="Experimental site",ylab="RT in ms" , ylim=c(250, 4250)); grid()boxplot(main="RT in ms ~ experimental site", d$RT ~ d$SITE, notch=TRUE, outline=FALSE,xlab="Experimental site",ylab="RT in ms (no 'outliers')"); grid()par(mfrow=c(1, 1)) # reset to default
It is noteworthy that site c differs considerably from the others in the outlier distribution; this would definitely require some attention!
5.2 Modeling & numerical interpretation
Let’s fit our regression model:
summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+# RT ~ an overall intercept (1) SITE, # & the predictor SITEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1 + SITE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-346.9 -158.5 -64.8 76.2 3416.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 603.449 5.292 114.040 <2e-16 ***
SITEb 60.334 7.204 8.375 <2e-16 ***
SITEc 110.467 7.386 14.956 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 259.4 on 7749 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.0281, Adjusted R-squared: 0.02785
F-statistic: 112 on 2 and 7749 DF, p-value: < 2.2e-16
The CI computation for the linear model is shown here, first manually again and then the easy way with Confint:
# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7749) # the t-value for 2.5% at that df
(Intercept)
593.0765
# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7749) # the t-value for 2.5% at that df
(Intercept)
613.8223
# lower bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df
SITEb
46.21166
# upper bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that df
SITEb
74.45678
# lower bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df
SITEc
95.98823
# upper bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that df
What is the deviance of our model with its one predictor (but 2 contrasts from the intercept)?
deviance(m_01)
[1] 521397026
Again pretty bad – how much is the reduction in % relative to the null model?
"/"(deviance(m_00)-deviance(m_01),deviance(m_00))
[1] 0.02809946
As usual, we add the predictions of the model to the original data frame:
d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d, 3) # check ...
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE PRED
1 1 748 5 spanish heritage correct 19 4.278754 b 663.7837
2 2 408 5 spanish heritage correct 78 4.892095 b 663.7837
3 3 483 6 spanish heritage correct 233 5.367356 b 663.7837
tail(d, 3) # the result
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE PRED
7998 8608 947 5 english english correct 80 4.903090 a 603.4494
7999 8609 660 4 english english correct 746 5.872739 a 603.4494
8000 8610 437 5 english english correct 54 4.732394 c 713.9163
5.3 Visual interpretation
Let’s visualize the nature of the effect of SITE based on the predictions from effects:
(sit_d <-data.frame( # make sit_d a data frame of sit <-effect ( # the effect"SITE", # of SITE m_01))) # in m_01
SITE fit se lower upper
1 a 603.4494 5.291570 593.0765 613.8223
2 b 663.7837 4.889025 654.1998 673.3675
3 c 713.9163 5.152976 703.8151 724.0176
plot(sit, # plot the effect of SITE in m_01xlab="Site", # w/ this x-axis labelylab="RT in ms", # w/ this y-axis labelylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a grid
plot(sit, # plot the effect of SITExlab="Site", # w/ this x-axis labelylab="RT in ms (middle 90%)", # w/ this y-axis labelylim=c(417, 1138), # w/ these y-axis limits (middle 90%)grid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:
par(mfrow=c(1,2))stripchart(main="RT ~ SITE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$SITE, # data: RT ~ SITExlab="RT in ms", ylab="Site", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")abline(h=c(417, 1138), lty=3, col="blue")points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &for (i inseq(sit_d$fit)) { # their confidence intervalsarrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")}stripchart(main="RT ~ SITE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$SITE, # data: RT ~ SITExlab="RT in ms", ylab="Site", # axis labelsylim=c(417, 1138), # y-axis limits: middle 90%vertical=TRUE, # make the chart have a vertical y-axismethod="jitter", # jitter observationsframe.plot=FALSE) # suppress boxbox(col="blue", lty=3)# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &for (i inseq(sit_d$fit)) { # their confidence intervalsarrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")}par(mfrow=c(1,1))
Excursus: This is an interesting case because only the very last plot shows something that we haven’t noticed before, namely the fact that both sites a and c have a surprising gap of RT values around 600 ms (specifically, no values in the interval [590, 602]); we can see this much more clearly in what is one of my favorite exploratory plots anyway, an ecdf plot:
plot(main="RT in ms ~ experimental site", type="n",xlab="RT in ms" , xlim=c(7.75, 12.25), x=0,ylab="Cumulative proportion", ylim=c(0 , 1 ), y=0); grid()for (i in1:3) { # for each of the 3 siteslines(pch=".", # plot lines w/ this point characterecdf(log2(d$RT[d$SITE==levels(d$SITE)[i]])), # namely ecdf linescol=rainbow(3)[i]) # using these colors}legend(title="Site", # add a legend with the title "Site"x=10.5, xjust=0.5, # at these x-coordinates (centered)y=0.5, yjust=0.5, # at these y-coordinates (centered)legend=levels(d$SITE), # these are the labels, show them ...fill=rainbow(3), # w/ these colorsncol=3, bty="n") # organize the legend in 3 columns, no box
5.4 Excursus: model comparison
In cases where one has multiple levels of a categorical predictor, sometimes the question arises as to whether (all) levels of that predictor are in fact significantly different from each other. In this particular case, the effects plot for SITE very clearly suggests that they are (because the confidence intervals of all predicted means do not overlap), but let’s discuss briefly how this can be tested with a method called model comparison: If you have two models that were fit to the same data set and where one model is a sub-model of the other model – typically such that they differ only by one predictor/term – then you can use the function anova to compare the two models to see whether the more complex model can justify its higher degree of complexity by being significantly better than the less complex model. In this case, 2 of the 3 possible comparisons of SITE are already in the summary table:
From this, we know that SITE: a is significantly different from SITE: b and SITE: c, but we don’t know whether the 110.46690-60.33422=50.13268 difference between SITE: b and SITE: c is significant as well. To find that out, we can compare m_01 to a model in which SITE: b and SITE: c are not distinguished. For that, we first, we create a version of SITE that conflates levels as desired:
d$SITE_confl <-d$SITE # create a copy of SITElevels(d$SITE_confl) <-# overwrite the copy's levels such thatc("a", # "a" remains "a""b/c", "b/c") # "b" & "c" become "b/c"table(SITE=d$SITE, SITE_confl=d$SITE_confl) # check you did it right
SITE_confl
SITE a b/c
a 2403 0
b 0 2815
c 0 2782
head(d)
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE PRED
1 1 748 5 spanish heritage correct 19 4.278754 b 663.7837
2 2 408 5 spanish heritage correct 78 4.892095 b 663.7837
3 3 483 6 spanish heritage correct 233 5.367356 b 663.7837
4 4 1266 4 spanish heritage correct 133 5.123852 b 663.7837
5 5 636 6 spanish heritage correct 14 4.146128 b 663.7837
6 6 395 6 spanish heritage correct 67 4.826075 b 663.7837
SITE_confl
1 b/c
2 b/c
3 b/c
4 b/c
5 b/c
6 b/c
We already have the model with SITE as the only predictor so we only need to fit a new model with SITE_confl as the only predictor:
summary(m_01_siteconfl <-lm( # make/summarize the linear model m_01_siteconfl: RT ~1+ SITE_confl, # RT ~ an overall intercept (1) & SITE_confldata=d, na.action=na.exclude)) # the other arguments
Call:
lm(formula = RT ~ 1 + SITE_confl, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-332.4 -155.5 -67.4 77.5 3442.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 603.449 5.308 113.68 <2e-16 ***
SITE_conflb/c 84.084 6.390 13.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 260.2 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.02185, Adjusted R-squared: 0.02173
F-statistic: 173.1 on 1 and 7750 DF, p-value: < 2.2e-16
… and then we do a comparison with anova:
anova(m_01, # compare the model w/ the 3-level predictor SITE m_01_siteconfl, # & the model w/ the 2-level predictor SITE_confltest="F") # using an F-test
Analysis of Variance Table
Model 1: RT ~ 1 + SITE
Model 2: RT ~ 1 + SITE_confl
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7749 521397026
2 7750 524748642 -1 -3351616 49.812 1.839e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the two models is hugely significant: The finer resolution of SITE compared to SITE_confl makes the model significantly better (p<0.00001). Thus, we now know not only the size of the 3rd comparison (SITE: b vs. c, which was clear from summary(m_01)), but also its p-value.
Another way of getting such results is available from emmeans::emmeans, as you saw in the book. Here,
lines 2 and 3 of the model’s summary table correspond to lines 1 and 2 of the emmeans output;
line 3 of the emmeans output corresponds to the result we obtained with the model conflation approach above: the F-value of the model conflation approach is the squared t-value of the emmeans approach:
pairs( # compute pairwise comparisons emmeans::emmeans( # of means computedobject=m_01, # for the model m_01~ SITE), # namely all contrasts of this predictoradjust="none") # don't adjust for 3 comparisons (careful w/ this)
contrast estimate SE df t.ratio p.value
a - b -60.3 7.20 7749 -8.375 <.0001
a - c -110.5 7.39 7749 -14.956 <.0001
b - c -50.1 7.10 7749 -7.058 <.0001
Finally, a very flexible, powerful, and quite complicated way of getting all sorts of such results here is also available from multcomp::glht. (I only show this here as an appetizer, but a discussion of cool things glht can do is offered in other courses):
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = RT ~ 1 + SITE, data = d, na.action = na.exclude)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 == 0 -50.133 7.103 -7.058 1.84e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
5.5 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the site where the data were collected (a vs. b vs. c), a linear model was fit with RT as the response variable and SITE as a predictor. The model was highly significant (F=112, df1=2, df2=7749, p<0.0001) but explains very little of the response variable (mult. R2=0.028, adj. R2=0.028). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail (esp. for SITE: c).
Estimate
95%-CI
se
t
p2-tailed
Intercept (SITE=a)
603.45
[593.08, 613.82]
5.29
114.04
<0.001
SITEa→b
60.33
[46.21, 74.46]
7.2
8.38
<0.001
SITEa→c
110.47
[95.99, 124.95]
7.39
14.96
<0.001
A post-hoc comparison of the above model with one in which sites b and c were conflated indicates that sites b and c are also significantly different from each other (difference between means: 50.133, F=49.81/t=7.058, p<0.0001).
---title: "Ling 105: session 02: linear regr. 1 (key)"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2024-04-08 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 5 fig-height: 5 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# IntroductionLet's load a data set for a few linear modeling examples; the data are in [_input/rts.csv](_input/rts.csv) and you can find information about the variables/columns in [_input/rts.r](_input/rts.r).```{r prepworkspace}rm(list=ls(all.names=TRUE))library(car); library(effects); library(magrittr)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors```As in the book, we will first discuss some didactically motivated monofactorial models before we go over a 'proper' multifactorial modeling approach.# DevianceA useful thing to consider is **deviance**, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable's values are not all the mean, they vary around it:```{r variability}summary(d$RT)```One way to express the degree to which the values of `RT` vary around the mean would be the **average absolute deviation**:```{r aveabsdev}mean(abs(d$RT -mean(d$RT, na.rm=TRUE)), na.rm=TRUE)```But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called **deviance**:```{r nulldeviance1}sum((d$RT -mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)```This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called **null model**, i.e. a model with no predictors:```{r nullmodel}summary(m_00 <-lm( # make/summarize the linear model m_00: RT ~1, # RT ~ an overall intercept (1) & no predictordata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```That model's intercept is the overall mean of the response variable and that model's deviance is the same as what we computed manually just above:```{r nulldeviance2}deviance(m_00)```Note that this null model actually returns the same result as when you use a one-sample *t*-test on the reaction times; compare the mean and the *t*-value:```{r}t.test(d$RT)```And this is how we'd get the confidence intervals that `t.test` provides from the regression model:```{r}Confint(m_00)```And the manual computation would be this:```{r}# lower bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.025, df=7751) # the t-value for 2.5% at that df# upper bound:coef(m_00)[1] +# the intercept plussummary(m_00)$coef[1,2] *# the intercept's seqt(0.975, df=7751) # the t-value for 2.5% at that df```Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number ...# A numeric predictorDoes the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?## Exploration & preparationSome exploration of the response variable:```{r}hist(d$RT); hist(d$RT, breaks="FD")```Very long tail on the right ... Let's see where 'the middle 90%' of the data are, which might be useful as a good plotting range:```{r}quantile(d$RT, probs=seq(0, 1, 0.05), na.rm=TRUE)```Ok, so the middle 90% of the data are in [417, 1137.45], which means that some values like that might make for good *y*-axis limits (e.g. for some plots). Now the predictor:```{r}hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")# the predictor(s) w/ the responseplot(main="RT in ms as a function of frequency per million words",pch=16, col="#00000030",xlab="Frequency pmw", xlim=c( 0, 1200), x=d$FREQPMW,ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()abline(lm(d$RT ~ d$FREQPMW), col="red", lwd=2)```## Modeling & numerical interpretationHow would we have dealt with this in Ling 104? We would have computed a Pearson product-moment correlation coefficient *r* (or Spearman's ρ, either with `cor.test`) and maybe a simple linear model (with `lm`). Thus, we might have done something like this (for Pearson's *r*) ...```{r num_pearson}cor.test(d$RT, d$FREQPMW)```... or this for the simplest possible linear regression model:```{r num_lm1}summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+# RT ~ an overall intercept (1) FREQPMW, # & the predictor FREQPMWdata=d, # those variables are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```Note the equivalence of the classical Pearson's *r* and the linear model we fit:* the *t*-statistic from Pearson's *r* (`r cor.test(d$RT, d$FREQPMW)$statistic`) squared is `m_01`'s *F*-value (`r summary(m_01)[["fstatistic"]][1]`);* the *t*-statistic from Pearson's *r* (`r cor.test(d$RT, d$FREQPMW)$statistic`) is also the *t*-statistic of `FREQPMW` (`r summary(m_01)$coef[2,3]`);* the *df* of Pearson's *r* (`r cor.test(d$RT, d$FREQPMW)$parameter`) is also the *df*~2~ of `m_01` (`r summary(m_01)[["fstatistic"]][3]`);* the value of Pearson's *r* (`r cor.test(d$RT, d$FREQPMW)$estimate`) squared is `m_01`'s *Multiple R-squared* (`r summary(m_01)[["r.squared"]]`).But that correlation/model can't be 'right', given the descriptive/exploratory plots. Thus, we ask, does the reaction time to a word (in ms) vary as a function of a **logged version** of the frequency of that word? For that we use the already provided variable `ZIPFFREQ` so let's explore the new predictor:```{r}# the new predictor(s) on its/their ownhist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")# the predictor(s) w/ the responseplot(main="RT in ms as a function of frequency per million words",pch=16, col="#00000030",xlab="Zipf frequency", xlim=c( 3, 6), x=d$ZIPFFREQ,ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()abline(lm(d$RT ~ d$ZIPFFREQ), col="red", lwd=2)```Let's fit our new/second regression model:```{r }summary(m_02 <-lm( # make/summarize the linear model m_02: RT ~1+# RT ~ an overall intercept (1) ZIPFFREQ, # & the predictor ZIPFFREQ (which involves logging)data=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```What are the confidence intervals of the coefficients?```{r num_confint1}Confint(m_02)```And the manual computation would be this:```{r num_confint2}# lower bound for the intercept:coef(m_02)[1] +# the intercept plussummary(m_02)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the intercept:coef(m_02)[1] +# the intercept plussummary(m_02)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df# lower bound for the slope:coef(m_02)[2] +# the slope plussummary(m_02)$coef[2,2] *# the slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the slope:coef(m_02)[2] +# the slope plussummary(m_02)$coef[2,2] *# the slope's seqt(0.975, df=7750) # the t-value for 2.5% at that df```What is the deviance of our model with its one predictor?```{r}deviance(m_02)```Wow, that doesn't seem to have reduced the deviance much from that of the null model, which was `r deviance(m_00)`. How much is the reduction in %?```{r}"/"(deviance(m_00)-deviance(m_02),deviance(m_00))```Does that number look familiar?Let's now also add the predictions of the model to the original data frame (usually a good (house-keeping) habit to adopt):```{r num_preds}d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d) # check the result```## Visual interpretationLet's visualize the nature of the effect of `ZIPFFREQ` based on the predictions from `effects`:```{r}(zpf_d <-data.frame( # make zpf_d a data frame of zpf <-effect( # the effect"ZIPFFREQ", # of ZIPFFREQ m_02))) # in m_02plot(zpf, # plot the effect of ZIPFFREQ in m_02xlab="Zipf frequency", # w/ this x-axis labelylab="RT in ms", # w/ this y-axis labelylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a gridplot(zpf, # plot the effect of ZIPFFREQxlab="Zipf frequency", # w/ this x-axis labelylab="RT in ms (middle 90%)", # w/ this y-axis labelylim=c(417, 1138), # w/ these y-axis limits (middle 90%)grid=TRUE) # & w/ a grid```And here a version with everything: observed values, predicted values, and the confidence band of the predictions/regression line:```{r}#| fig-width: 10#| fig.show: holdpar(mfrow=c(1,2))plot(main="RT in ms as a function of Zipf frequency",pch=16, col="#00000030",xlab="Zipf frequency", xlim=c( 3, 6), x=d$ZIPFFREQ,ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()# add the regression line:abline(m_02, col="red", lwd=2)# add the confidence band:polygon(c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),c(zpf_d$lower, rev(zpf_d$upper)),col="#FF000030", border=NA)# indicate the middle 90%:abline(h=c(417, 1138), lty=3, col="blue")plot(main="RT in ms as a function of Zipf frequency",frame.plot=FALSE, pch=16, col="#00000030",xlab="Zipf frequency" , xlim=c( 3, 6), x=d$ZIPFFREQ,ylab="RT in ms (middle 90%)", ylim=c(417, 1138), y=d$RT)grid(); box(col="blue", lty=3)# add the regression line:abline(m_02, col="red", lwd=2)# add the confidence band:polygon(col="#FF000030", border=NA,c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),c(zpf_d$lower, rev(zpf_d$upper)))par(mfrow=c(1,1))```## Write-upTo determine whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words), a linear model was fit with `RT` as the response variable and `FREQPMW` as a predictor. Exploration of all variables involved indicated the skewed distribution of `FREQPMW`, which is why a log-transformed version of this predictor was used in its place, `ZIPFFREQ`. The model was highly significant (*F*=133.3, *df*~1~=1, *df*~2~=7750, *p*<0.0001) but explains very little of the response variable (mult. *R*^2^=0.017, adj. *R*^2^=0.017). The coefficients indicate that `ZIPFFREQ` is negatively correlated with `RT`, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail.| | Estimate | 95%-CI | *se* | *t* | *p*~2-tailed~ ||:-----------------------|:---------|:-----------------|:------|:-------|:--------------|| Intercept (ZIPFFREQ=0) | 949.36 | [900.13, 998.59] | 25.11 | 37.8 | <0.001 || ZIPFFREQ~0→1~ | -62.46 | [-73.06, -51.06] | 5.41 | -11.54 | <0.001 |```{r}# housekeeping: remove the predictionsstr(d <- d[,1:9])```# A binary predictorDoes the reaction time to a word (in ms) vary as a function of which language that word was presented in (*English* vs. *Spanish*)?## Exploration & preparationSome exploration of the relevant variables:```{r}# the predictor(s)/response on its/their ownhist(d$RT); hist(d$RT, breaks="FD")table(d$LANGUAGE)# the predictor(s) w/ the responseqwe <-boxplot(main="RT in ms as a function of stimulus language", d$RT ~ d$LANGUAGE, notch=TRUE,xlab="Stimulus language",ylab="RT in ms", ylim=c(250, 4250)); grid()```There are many outliers, let's plot this again without them:```{r}qwe$stats# the predictor(s) w/ the responseboxplot(main="RT in ms as a function of stimulus language", d$RT ~ d$LANGUAGE, notch=TRUE, outline=FALSE,xlab="Stimulus language",ylab="RT in ms (no 'outliers')"); grid()```## Modeling & numerical interpretationHow would we have dealt with this in Ling 104?Thus, we might have done something like this ...```{r bin_ttest}t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE) # the default t-test (Student)```This *t*-test is actually also just a very simple linear regression model:```{r bin_lm1}summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+# RT ~ an overall intercept (1) LANGUAGE, # & the predictor LANGUAGEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```Note the equivalence of the classical *t*-test and the linear model we fit (with the exception of the direction of the differences):* the *t*-statistic from the *t*-test (`r t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE)$statistic`) squared is `m_01`'s *F*-value (`r summary(m_01)[["fstatistic"]][1]`);* the *t*-statistic from the *t*-test (`r t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE)$statistic`) is also the (negative of the) *t*-statistic of `LANGUAGEspa` (`r summary(m_01)$coef[2,3]`);* the *df* of the *t*-test (`r t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE)$parameter`) is also the *df*~2~ of `m_01` (`r summary(m_01)[["fstatistic"]][3]`);* the first mean reported by the *t*-test (*mean in group eng*, of `r t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE)$estimate[1]`) is the intercept of `m_01` (of `r coef(m_01)[1]`);* the second mean reported by the *t*-test (*mean in group spa*, of `r t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE)$estimate[2]`) is the sum of + the intercept of `m_01` (of `r coef(m_01)[1]`) and + the mean difference/slope of `m_01` (of `r coef(m_01)[2]`);* the confidence interval of the *t*-test (`r as.numeric(t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE)$conf.int)`) is the (negative) coefficient for `LANGUAGEspa` (of `r coef(m_01)[2]`) plus/minus the product of + its standard error (of `r summary(m_01)$coef[2,2]`) and + the *t*-value that cuts off 2.5% of the area under the curve for *df*=7750 (`qt(p=0.025, df=7750)`, i.e. `r qt(p=0.025, df=7750)`).The CI computation for the linear model is shown here, first manually again and then the easy way with `Confint`:```{r bin_confint}# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7750) # the t-value for 2.5% at that df# lower bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7750) # the t-value for 2.5% at that df# upper bound for the difference/slope:coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7750) # the t-value for 2.5% at that dfConfint(m_01)```Back to the model: What is the deviance of our model with its one predictor?```{r}deviance(m_01)```This seems to have reduced the deviance much from that of the null model, which was `r deviance(m_00)` -- again, how much is the reduction in %?```{r}"/"(deviance(m_00)-deviance(m_01),deviance(m_00))```Which I am again hoping is familiar by now ...We again add the predictions of the model to the original data frame:```{r}d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d, 3) # check ...tail(d, 3) # the result```## Visual interpretationLet's visualize the nature of the effect of `LANGUAGE` based on the predictions from `effects`:```{r}(lan_d <-data.frame( # make lan_d a data frame of lan <-effect( # the effect"LANGUAGE", # of LANGUAGE m_01))) # in m_01plot(lan, # plot the effect of LANGUAGE in m_01xlab="Language", # w/ this x-axis labelylab="RT in ms", # w/ this y-axis labelylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a gridplot(lan, # plot the effect of LANGUAGExlab="Language", # w/ this x-axis labelylab="RT in ms (middle 90%)", # w/ this y-axis labelylim=c(417, 1138), # w/ these y-axis limits (middle 90%)grid=TRUE) # & w/ a grid```And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:```{r}#| fig-width: 10#| fig-show: holdpar(mfrow=c(1,2))stripchart(main="RT ~ LANGUAGE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$LANGUAGE, # data: RT ~ LANGUAGExlab="Language", ylab="RT in ms", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")abline(h=c(417, 1138), lty=3, col="blue")points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &for (i inseq(lan_d$fit)) { # their confidence intervalsarrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")}stripchart(main="RT ~ LANGUAGE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$LANGUAGE, # data: RT ~ LANGUAGExlab="Language", ylab="RT in ms (middle 90%)", # axis labelsylim=c(417, 1138), # y-axis limits: middle 90%vertical=TRUE, # make the chart have a vertical y-axismethod="jitter", # jitter observationsframe.plot=FALSE) # suppress boxbox(col="blue", lty=3)# add horizontal grid lines at deciles:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &for (i inseq(lan_d$fit)) { # their confidence intervalsarrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")}par(mfrow=c(1,1))```## Write-upTo determine whether the reaction time to a word (in ms) varies as a function of the language the word was presented in (*english* vs. *spanish*), a linear model was fit with `RT` as the response variable and `LANGUAGE` as a predictor. The model was highly significant (*F*=559.8, *df*~1~=1, *df*~2~=7750, *p*<0.0001) but explains very little of the response variable (mult. *R*^2^=0.067, adj. *R*^2^=0.067). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail.| | Estimate | 95%-CI | *se* | *t* | *p*~2-tailed~ ||:-----------------------------------|:---------|:-----------------|:-----|:-------|:--------------|| Intercept (LANGUAGE=*english*) | 594.94 | [587.05, 602.84] | 4.03 | 147.66 | <0.001 || LANGUAGE~*english*→*spanish*~ | 136.61 | [125.29, 147.93] | 5.77 | 23.66 | <0.001 |```{r}# housekeeping: remove the predictionsstr(d <- d[,1:9])```# A categorical predictorDoes the reaction time to a word (in ms) vary as a function of in which of three experimental sites the data were collected (*a* vs. *b* vs. *c*)?## Exploration & preparationSome exploration of the relevant variables:```{r}# the predictor(s)/response on its/their ownhist(d$RT); hist(d$RT, breaks="FD")table(d$SITE)``````{r}#| fig-show: hold# the predictor(s) w/ the responsepar(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columnsqwe <-boxplot(main="RT in ms ~ experimental site", d$RT ~ d$SITE, notch=TRUE,xlab="Experimental site",ylab="RT in ms" , ylim=c(250, 4250)); grid()boxplot(main="RT in ms ~ experimental site", d$RT ~ d$SITE, notch=TRUE, outline=FALSE,xlab="Experimental site",ylab="RT in ms (no 'outliers')"); grid()par(mfrow=c(1, 1)) # reset to default```It is noteworthy that site *c* differs considerably from the others in the outlier distribution; this would definitely require some attention!## Modeling & numerical interpretationLet's fit our regression model:```{r}summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+# RT ~ an overall intercept (1) SITE, # & the predictor SITEdata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data (good habit)```The CI computation for the linear model is shown here, first manually again and then the easy way with `Confint`:```{r cat_confint}# lower bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.025, df=7749) # the t-value for 2.5% at that df# upper bound for the intercept:coef(m_01)[1] +# the intercept plussummary(m_01)$coef[1,2] *# the intercept's seqt(0.975, df=7749) # the t-value for 2.5% at that df# lower bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df# upper bound for the difference/slope for SITE b (rel. to a):coef(m_01)[2] +# the difference/slope plussummary(m_01)$coef[2,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that df# lower bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.025, df=7749) # the t-value for 2.5% at that df# upper bound for the difference/slope for SITE c (rel. to a):coef(m_01)[3] +# the difference/slope plussummary(m_01)$coef[3,2] *# the difference/slope's seqt(0.975, df=7749) # the t-value for 2.5% at that dfConfint(m_01)```What is the deviance of our model with its one predictor (but 2 contrasts from the intercept)?```{r}deviance(m_01)```Again pretty bad -- how much is the reduction in % relative to the null model?```{r}"/"(deviance(m_00)-deviance(m_01),deviance(m_00))```As usual, we add the predictions of the model to the original data frame:```{r}d$PRED <-# make d$PRED thepredict( # predictions for all cases m_01) # based on m_01head(d, 3) # check ...tail(d, 3) # the result```## Visual interpretationLet's visualize the nature of the effect of `SITE` based on the predictions from `effects`:```{r}(sit_d <-data.frame( # make sit_d a data frame of sit <-effect ( # the effect"SITE", # of SITE m_01))) # in m_01plot(sit, # plot the effect of SITE in m_01xlab="Site", # w/ this x-axis labelylab="RT in ms", # w/ this y-axis labelylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limitsgrid=TRUE) # & w/ a gridplot(sit, # plot the effect of SITExlab="Site", # w/ this x-axis labelylab="RT in ms (middle 90%)", # w/ this y-axis labelylim=c(417, 1138), # w/ these y-axis limits (middle 90%)grid=TRUE) # & w/ a grid```And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:```{r}#| fig-width: 10#| fig-show: holdpar(mfrow=c(1,2))stripchart(main="RT ~ SITE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$SITE, # data: RT ~ SITExlab="RT in ms", ylab="Site", # axis labelsvertical=TRUE, # make the chart have a vertical y-axismethod="jitter") # jitter observations# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")abline(h=c(417, 1138), lty=3, col="blue")points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &for (i inseq(sit_d$fit)) { # their confidence intervalsarrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")}stripchart(main="RT ~ SITE", # stripchart w/ main headingpch=16, col="#00000020", # filled semi-transparent grey circles d$RT ~ d$SITE, # data: RT ~ SITExlab="RT in ms", ylab="Site", # axis labelsylim=c(417, 1138), # y-axis limits: middle 90%vertical=TRUE, # make the chart have a vertical y-axismethod="jitter", # jitter observationsframe.plot=FALSE) # suppress boxbox(col="blue", lty=3)# add horizontal grid lines at deciles & indicate the middle 90%:abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &for (i inseq(sit_d$fit)) { # their confidence intervalsarrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")}par(mfrow=c(1,1))```Excursus: This is an interesting case because only the very last plot shows something that we haven't noticed before, namely the fact that both sites *a* and *c* have a surprising gap of RT values around 600 ms (specifically, no values in the interval [590, 602]); we can see this much more clearly in what is one of my favorite exploratory plots anyway, an ecdf plot:```{r}plot(main="RT in ms ~ experimental site", type="n",xlab="RT in ms" , xlim=c(7.75, 12.25), x=0,ylab="Cumulative proportion", ylim=c(0 , 1 ), y=0); grid()for (i in1:3) { # for each of the 3 siteslines(pch=".", # plot lines w/ this point characterecdf(log2(d$RT[d$SITE==levels(d$SITE)[i]])), # namely ecdf linescol=rainbow(3)[i]) # using these colors}legend(title="Site", # add a legend with the title "Site"x=10.5, xjust=0.5, # at these x-coordinates (centered)y=0.5, yjust=0.5, # at these y-coordinates (centered)legend=levels(d$SITE), # these are the labels, show them ...fill=rainbow(3), # w/ these colorsncol=3, bty="n") # organize the legend in 3 columns, no box```## Excursus: model comparisonIn cases where one has multiple levels of a categorical predictor, sometimes the question arises as to whether (all) levels of that predictor are in fact significantly different from each other. In this particular case, the effects plot for `SITE` very clearly suggests that they are (because the confidence intervals of all predicted means do not overlap), but let's discuss briefly how this can be tested with a method called **model comparison**: If you have two models that were fit to the same data set and where one model is a sub-model of the other model -- typically such that they differ only by one predictor/term -- then you can use the function `anova` to compare the two models to see whether the more complex model can justify its higher degree of complexity by being significantly better than the less complex model. In this case, 2 of the 3 possible comparisons of `SITE` are already in the summary table:```{r}summary(m_01)$coefficients```From this, we know that `SITE`: *a* is significantly different from `SITE`: *b* and `SITE`: *c*, but we don't know whether the 110.46690-60.33422=50.13268 difference between `SITE`: *b* and `SITE`: *c* is significant as well. To find that out, we can compare `m_01` to a model in which `SITE`: *b* and `SITE`: *c* are not distinguished. For that, we first, we create a version of `SITE` that conflates levels as desired:```{r}d$SITE_confl <-d$SITE # create a copy of SITElevels(d$SITE_confl) <-# overwrite the copy's levels such thatc("a", # "a" remains "a""b/c", "b/c") # "b" & "c" become "b/c"table(SITE=d$SITE, SITE_confl=d$SITE_confl) # check you did it righthead(d)```We already have the model with `SITE` as the only predictor so we only need to fit a new model with `SITE_confl` as the only predictor:```{r}summary(m_01_siteconfl <-lm( # make/summarize the linear model m_01_siteconfl: RT ~1+ SITE_confl, # RT ~ an overall intercept (1) & SITE_confldata=d, na.action=na.exclude)) # the other arguments```... and then we do a comparison with `anova`:```{r}anova(m_01, # compare the model w/ the 3-level predictor SITE m_01_siteconfl, # & the model w/ the 2-level predictor SITE_confltest="F") # using an F-test```The difference between the two models is hugely significant: The finer resolution of `SITE` compared to `SITE_confl` makes the model significantly better (*p*<0.00001). Thus, we now know not only the size of the 3rd comparison (`SITE`: *b* vs. *c*, which was clear from `summary(m_01)`), but also its *p*-value.Another way of getting such results is available from `emmeans::emmeans`, as you saw in the book. Here,* lines 2 and 3 of the model's summary table correspond to lines 1 and 2 of the `emmeans` output;* line 3 of the `emmeans` output corresponds to the result we obtained with the model conflation approach above: the *F*-value of the model conflation approach is the squared *t*-value of the `emmeans` approach:```{r}summary(m_01)$coefficientspairs( # compute pairwise comparisons emmeans::emmeans( # of means computedobject=m_01, # for the model m_01~ SITE), # namely all contrasts of this predictoradjust="none") # don't adjust for 3 comparisons (careful w/ this)```Finally, a very flexible, powerful, and quite complicated way of getting all sorts of such results here is also available from `multcomp::glht`. (I only show this here as an appetizer, but a discussion of cool things `glht` can do is offered in other courses):```{r}c(0, 1, -1) %>%matrix(nrow=1) %>% multcomp::glht(m_01, .) %>% summary```## Write-upTo determine whether the reaction time to a word (in ms) varies as a function of the site where the data were collected (*a* vs. *b* vs. *c*), a linear model was fit with `RT` as the response variable and `SITE` as a predictor. The model was highly significant (*F*=112, *df*~1~=2, *df*~2~=7749, *p*<0.0001) but explains very little of the response variable (mult. *R*^2^=0.028, adj. *R*^2^=0.028). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail (esp. for `SITE`: *c*).| | Estimate | 95%-CI | *se* | *t* | *p*~2-tailed~ ||:---------------------|:---------|:-----------------|:-----|:-------|:--------------|| Intercept (SITE=*a*) | 603.45 | [593.08, 613.82] | 5.29 | 114.04 | <0.001 || SITE~*a*→*b*~ | 60.33 | [46.21, 74.46] | 7.2 | 8.38 | <0.001 || SITE~*a*→*c*~ | 110.47 | [95.99, 124.95] | 7.39 | 14.96 | <0.001 |A post-hoc comparison of the above model with one in which sites *b* and *c* were conflated indicates that sites *b* and *c* are also significantly different from each other (difference between means: 50.133, *F*=49.81/*t*=7.058, *p*<0.0001).```{r sessionInfo}sessionInfo()```