Let’s load a data set for a few linear modeling examples:
rm(list=ls(all.names=TRUE))
library(car); library(effects)
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE RT LENGTH LANGUAGE GROUP
## Min. : 1 Min. : 271.0 Min. :3.000 english:4023 english :3961
## 1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spanish:3977 heritage:4039
## Median :4310 Median : 595.0 Median :5.000
## Mean :4303 Mean : 661.5 Mean :5.198
## 3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
## Max. :8610 Max. :4130.0 Max. :9.000
## NA's :248
## CORRECT FREQPMW ZIPFFREQ SITE
## correct :7749 Min. : 1.00 Min. :3.000 a:2403
## incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
## Median : 42.00 Median :4.623 c:2782
## Mean : 81.14 Mean :4.591
## 3rd Qu.: 101.00 3rd Qu.:5.004
## Max. :1152.00 Max. :6.061
##
Here’s what the columns are/contain:
CASE
is a simple case/row number just so
that each case has a unique ID;RT
is the response variable: it’s the
reaction time in ms that indicates how long a speaker needed to decide
whether the word displayed on a computer screen was a real word or
not;LENGTH
is a numeric predictor: it’s the
length of the stimulus word displayed on the computer screen in
characters; usually, longer words need longer reaction times;LANGUAGE
is a binary predictor: it’s the
language of the stimulus word displayed on the computer screen
(english vs. spanish);GROUP
is a binary predictor: it’s whether
the speaker to which the stimulus word was presented was a learner of
Spanish (english) or a heritage speaker of Spanish
(heritage);CORRECT
is a binary predictor or control:
it’s whether the speaker correctly classified the stimulus word
displayed on the computer screen as a word or a non-word
(correct vs. incorrect);FREQPMW
is a numeric predictor or control:
it’s the frequency of the stimulus word per million words in a large
reference corpus;ZIPFFREQ
is a numeric predictor or control:
it’s the transformed frequency of the stimulus word in a large reference
corpus (the transformation included a log to the base of 10);SITE
is a ternary predictor or control: it’s
at which of three locations speakers participated in the experiment
(a vs. b vs. c).As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.
A useful thing to consider is deviance, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable’s values are not all the mean, they vary around it:
summary(d$RT)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 271.0 505.0 595.0 661.5 732.0 4130.0 248
One way to express the degree to which the values of RT
vary around the mean would be the average absolute deviation:
mean(abs(d$RT - mean(d$RT, na.rm=TRUE)), na.rm=TRUE)
## [1] 172.5416
But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called deviance:
sum((d$RT - mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)
## [1] 536471586
This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called null model, i.e. a model with no predictors:
summary(m.00 <- lm( # make/summarize the linear model m.01:
RT ~ 1, # RT ~ an overall intercept (1) & no predictor
data=d, # those variables are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
##
## Call:
## lm(formula = RT ~ 1, data = d, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.5 -156.5 -66.5 70.5 3468.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 661.469 2.988 221.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.1 on 7751 degrees of freedom
## (248 observations deleted due to missingness)
That model’s intercept is the overall mean of the response variable and that model’s deviance is the same as what we computed manually just above:
deviance(m.00)
## [1] 536471586
Note that this null model actually returns the same result as when you use a one-sample t-test on the reaction times; compare the mean and the t-value:
t.test(d$RT)
##
## One Sample t-test
##
## data: d$RT
## t = 221.37, df = 7751, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 655.6111 667.3259
## sample estimates:
## mean of x
## 661.4685
And this is how we’d get the confidence intervals that
t.test
provides from the regression model:
Confint(m.00)
## Estimate 2.5 % 97.5 %
## (Intercept) 661.4685 655.6111 667.3259
And the manual computation would be this:
# lower bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.025, df=7751) # the t-value for 2.5% at that df
## (Intercept)
## 655.6111
# upper bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.975, df=7751) # the t-value for 2.5% at that df
## (Intercept)
## 667.3259
Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …
Does the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?
Some exploration of the relevant variables:
# the predictor(s)/response on its/their own
hist(d$RT); hist(d$RT, breaks="FD")
Very long tail on the right … Let’s see where ‘the middle 90%’ of the data are, which might be useful as a good plotting range:
quantile(d$RT, probs=seq(0, 1, 0.05), na.rm=TRUE)
## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
## 271.00 417.00 446.00 469.65 487.00 505.00 522.00 538.00 556.00 576.00
## 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
## 595.00 614.00 640.00 665.00 695.00 732.00 777.00 834.00 929.00 1137.45
## 100%
## 4130.00
Ok, so the middle 90% of the data are in [417, 1137.45], which means that some values like that might make for good y-axis limits (e.g. for some plots). Now the predictor:
hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")
# the predictor(s) w/ the response
plot(
main="RT in ms as a function of frequency per million words",
pch=16, col="#00000030",
xlab="Frequency pmw", xlim=c( 0, 1200), x=d$FREQPMW,
ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()
abline(lm(d$RT ~ d$FREQPMW), col="red", lwd=2)