```
rm(list=ls(all.names=TRUE))
library(car); library(effects)
summary(x <- read.delim( # summarize x, the result of loading
file="105_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
```

```
## CASE RT LENGTH LANGUAGE GROUP
## Min. : 1 Min. : 271.0 Min. :3.000 eng:4023 english :3961
## 1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spa:3977 heritage:4039
## Median :4310 Median : 595.0 Median :5.000
## Mean :4303 Mean : 661.5 Mean :5.198
## 3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
## Max. :8610 Max. :4130.0 Max. :9.000
## NA's :248
## CORRECT FREQPMW ZIPFFREQ SITE
## correct :7749 Min. : 1.00 Min. :3.000 a:2403
## incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
## Median : 42.00 Median :4.623 c:2782
## Mean : 81.14 Mean :4.591
## 3rd Qu.: 101.00 3rd Qu.:5.004
## Max. :1152.00 Max. :6.061
##
```

A useful thing to consider is **deviance**, which we
will here define as the overall amount of variability in the response
variable. As wet talked about before in class, the response variable has
a mean, but the response variableâ€™s values are not all the mean, they
vary around it:

`summary(x$RT)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 271.0 505.0 595.0 661.5 732.0 4130.0 248
```

One way to express the degree to which the values of `RT`

vary around the mean would be the average absolute deviation:

`mean(abs(x$RT - mean(x$RT, na.rm=TRUE)), na.rm=TRUE)`

`## [1] 172.5416`

But as you know, this is not how we usually quantify deviations from the mean: as in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the deviance:

`sum((x$RT - mean(x$RT, na.rm=TRUE))^2, na.rm=TRUE)`

`## [1] 536471586`

This number quantifies how much all data points together deviate from
the mean. Another useful way to look at this would be to already use a
linear model quantify this variability of the values around the mean. We
can do this by fitting a so-called **null model**, i.e.Â a
model with no predictors:

```
summary(m.00 <- lm( # make/summarize the linear model m.01:
RT ~ 1, # RT ~ an overall intercept (1) & no predictor
data=x, # those variables are in x
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
```

```
##
## Call:
## lm(formula = RT ~ 1, data = x, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.5 -156.5 -66.5 70.5 3468.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 661.469 2.988 221.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.1 on 7751 degrees of freedom
## (248 observations deleted due to missingness)
```

That modelâ€™s deviance is the same as what we computed manually just above:

`deviance(m.00)`

`## [1] 536471586`

Note that this null model actually returns the same result as when
you use a one-sample *t*-test on the reaction times:

`t.test(x$RT)`

```
##
## One Sample t-test
##
## data: x$RT
## t = 221.37, df = 7751, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 655.6111 667.3259
## sample estimates:
## mean of x
## 661.4685
```

And this is how weâ€™d get the confidence intervals for the intercept or coefficients more generally:

`Confint(m.00)`

```
## Estimate 2.5 % 97.5 %
## (Intercept) 661.4685 655.6111 667.3259
```

And the manual computation would be this:

```
# lower bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.025, df=7751) # the t-value for 2.5% at that df
```

```
## (Intercept)
## 655.6111
```

```
# upper bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.975, df=7751) # the t-value for 2.5% at that df
```

```
## (Intercept)
## 667.3259
```

Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number â€¦

Does the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?

Some exploration of the relevant variables:

```
# the predictor(s)/response on its/their own
hist(x$RT); hist(x$RT, breaks="FD")
```

Very long tail on the right â€¦ Letâ€™s see where â€˜the middle 90%â€™ of the data are, which might be useful as a good plotting range:

`quantile(x$RT, probs=seq(0, 1, 0.05), na.rm=TRUE)`

```
## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
## 271.00 417.00 446.00 469.65 487.00 505.00 522.00 538.00 556.00 576.00
## 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
## 595.00 614.00 640.00 665.00 695.00 732.00 777.00 834.00 929.00 1137.45
## 100%
## 4130.00
```

Ok, so the middle 90% of the data are in [417, 1137.45], which means
that something like `c(400, 1150)`

might make for good
*y*-axis limits (at least for regression predictions). Now the
predictor:

`hist(x$FREQPMW); hist(x$FREQPMW, breaks="FD")`

```
# the predictor(s) w/ the response
plot(
main="RT in ms as a function of frequency per million words",
pch=16, col="#00000030",
xlab="Frequency pmw", xlim=c(0, 1200) , x=x$FREQPMW,
ylab="RT in ms" , ylim=c(250, 4250), y=x$RT); grid()
abline(lm(x$RT ~ x$FREQPMW), col="red", lwd=2)
```