rm(list=ls(all.names=TRUE))
library(car); library(effects)
summary(x <- read.delim( # summarize x, the result of loading
file="105_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE RT LENGTH LANGUAGE GROUP
## Min. : 1 Min. : 271.0 Min. :3.000 eng:4023 english :3961
## 1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spa:3977 heritage:4039
## Median :4310 Median : 595.0 Median :5.000
## Mean :4303 Mean : 661.5 Mean :5.198
## 3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
## Max. :8610 Max. :4130.0 Max. :9.000
## NA's :248
## CORRECT FREQPMW ZIPFFREQ SITE
## correct :7749 Min. : 1.00 Min. :3.000 a:2403
## incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
## Median : 42.00 Median :4.623 c:2782
## Mean : 81.14 Mean :4.591
## 3rd Qu.: 101.00 3rd Qu.:5.004
## Max. :1152.00 Max. :6.061
##
A useful thing to consider is deviance, which we will here define as the overall amount of variability in the response variable. As wet talked about before in class, the response variable has a mean, but the response variable’s values are not all the mean, they vary around it:
summary(x$RT)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 271.0 505.0 595.0 661.5 732.0 4130.0 248
One way to express the degree to which the values of RT
vary around the mean would be the average absolute deviation:
mean(abs(x$RT - mean(x$RT, na.rm=TRUE)), na.rm=TRUE)
## [1] 172.5416
But as you know, this is not how we usually quantify deviations from the mean: as in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the deviance:
sum((x$RT - mean(x$RT, na.rm=TRUE))^2, na.rm=TRUE)
## [1] 536471586
This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model quantify this variability of the values around the mean. We can do this by fitting a so-called null model, i.e. a model with no predictors:
summary(m.00 <- lm( # make/summarize the linear model m.01:
RT ~ 1, # RT ~ an overall intercept (1) & no predictor
data=x, # those variables are in x
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
##
## Call:
## lm(formula = RT ~ 1, data = x, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.5 -156.5 -66.5 70.5 3468.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 661.469 2.988 221.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.1 on 7751 degrees of freedom
## (248 observations deleted due to missingness)
That model’s deviance is the same as what we computed manually just above:
deviance(m.00)
## [1] 536471586
Note that this null model actually returns the same result as when you use a one-sample t-test on the reaction times:
t.test(x$RT)
##
## One Sample t-test
##
## data: x$RT
## t = 221.37, df = 7751, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 655.6111 667.3259
## sample estimates:
## mean of x
## 661.4685
And this is how we’d get the confidence intervals for the intercept or coefficients more generally:
Confint(m.00)
## Estimate 2.5 % 97.5 %
## (Intercept) 661.4685 655.6111 667.3259
And the manual computation would be this:
# lower bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.025, df=7751) # the t-value for 2.5% at that df
## (Intercept)
## 655.6111
# upper bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.975, df=7751) # the t-value for 2.5% at that df
## (Intercept)
## 667.3259
Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …
Does the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?
Some exploration of the relevant variables:
# the predictor(s)/response on its/their own
hist(x$RT); hist(x$RT, breaks="FD")
Very long tail on the right … Let’s see where ‘the middle 90%’ of the data are, which might be useful as a good plotting range:
quantile(x$RT, probs=seq(0, 1, 0.05), na.rm=TRUE)
## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
## 271.00 417.00 446.00 469.65 487.00 505.00 522.00 538.00 556.00 576.00
## 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
## 595.00 614.00 640.00 665.00 695.00 732.00 777.00 834.00 929.00 1137.45
## 100%
## 4130.00
Ok, so the middle 90% of the data are in [417, 1137.45], which means
that something like c(400, 1150)
might make for good
y-axis limits (at least for regression predictions). Now the
predictor:
hist(x$FREQPMW); hist(x$FREQPMW, breaks="FD")
# the predictor(s) w/ the response
plot(
main="RT in ms as a function of frequency per million words",
pch=16, col="#00000030",
xlab="Frequency pmw", xlim=c(0, 1200) , x=x$FREQPMW,
ylab="RT in ms" , ylim=c(250, 4250), y=x$RT); grid()
abline(lm(x$RT ~ x$FREQPMW), col="red", lwd=2)