Letâ€™s load a data set for a few linear modeling examples:

```
rm(list=ls(all.names=TRUE))
library(car); library(effects)
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
```

```
## CASE RT LENGTH LANGUAGE GROUP
## Min. : 1 Min. : 271.0 Min. :3.000 english:4023 english :3961
## 1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spanish:3977 heritage:4039
## Median :4310 Median : 595.0 Median :5.000
## Mean :4303 Mean : 661.5 Mean :5.198
## 3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
## Max. :8610 Max. :4130.0 Max. :9.000
## NA's :248
## CORRECT FREQPMW ZIPFFREQ SITE
## correct :7749 Min. : 1.00 Min. :3.000 a:2403
## incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
## Median : 42.00 Median :4.623 c:2782
## Mean : 81.14 Mean :4.591
## 3rd Qu.: 101.00 3rd Qu.:5.004
## Max. :1152.00 Max. :6.061
##
```

Hereâ€™s what the columns are/contain:

- the column
`CASE`

is a simple case/row number just so that each case has a unique ID; - the column
`RT`

is the response variable: itâ€™s the reaction time in ms that indicates how long a speaker needed to decide whether the word displayed on a computer screen was a real word or not; - the column
`LENGTH`

is a numeric predictor: itâ€™s the length of the stimulus word displayed on the computer screen in characters; usually, longer words need longer reaction times; - the column
`LANGUAGE`

is a binary predictor: itâ€™s the language of the stimulus word displayed on the computer screen (*english*vs.Â*spanish*); - the column
`GROUP`

is a binary predictor: itâ€™s whether the speaker to which the stimulus word was presented was a learner of Spanish (*english*) or a heritage speaker of Spanish (*heritage*); - the column
`CORRECT`

is a binary predictor or control: itâ€™s whether the speaker correctly classified the stimulus word displayed on the computer screen as a word or a non-word (*correct*vs.Â*incorrect*); - the column
`FREQPMW`

is a numeric predictor or control: itâ€™s the frequency of the stimulus word per million words in a large reference corpus; - the column
`ZIPFFREQ`

is a numeric predictor or control: itâ€™s the transformed frequency of the stimulus word in a large reference corpus (the transformation included a log to the base of 10); - the column
`SITE`

is a ternary predictor or control: itâ€™s at which of three locations speakers participated in the experiment (*a*vs.Â*b*vs.Â*c*).

As in the book, we will first discuss some didactically motivated monofactorial models before we go over a â€˜properâ€™ multifactorial modeling approach.

A useful thing to consider is **deviance**, which we
will here define as the overall amount of variability in the response
variable. As we talked about before in class, the response variable has
a mean, but the response variableâ€™s values are not all the mean, they
vary around it:

`summary(d$RT)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 271.0 505.0 595.0 661.5 732.0 4130.0 248
```

One way to express the degree to which the values of `RT`

vary around the mean would be the average absolute deviation:

`mean(abs(d$RT - mean(d$RT, na.rm=TRUE)), na.rm=TRUE)`

`## [1] 172.5416`

But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called deviance:

`sum((d$RT - mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)`

`## [1] 536471586`

This number quantifies how much all data points together deviate from
the mean. Another useful way to look at this would be to already use a
linear model to quantify this variability of the values around the mean.
We can do this by fitting a so-called **null model**,
i.e.Â a model with no predictors:

```
summary(m.00 <- lm( # make/summarize the linear model m.01:
RT ~ 1, # RT ~ an overall intercept (1) & no predictor
data=d, # those variables are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
```

```
##
## Call:
## lm(formula = RT ~ 1, data = d, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.5 -156.5 -66.5 70.5 3468.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 661.469 2.988 221.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.1 on 7751 degrees of freedom
## (248 observations deleted due to missingness)
```

That modelâ€™s intercept is the overall mean of the response variable and that modelâ€™s deviance is the same as what we computed manually just above:

`deviance(m.00)`

`## [1] 536471586`

Note that this null model actually returns the same result as when
you use a one-sample *t*-test on the reaction times; compare the
mean and the *t*-value:

`t.test(d$RT)`

```
##
## One Sample t-test
##
## data: d$RT
## t = 221.37, df = 7751, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 655.6111 667.3259
## sample estimates:
## mean of x
## 661.4685
```

And this is how weâ€™d get the confidence intervals that
`t.test`

provides from the regression model:

`Confint(m.00)`

```
## Estimate 2.5 % 97.5 %
## (Intercept) 661.4685 655.6111 667.3259
```

And the manual computation would be this:

```
# lower bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.025, df=7751) # the t-value for 2.5% at that df
```

```
## (Intercept)
## 655.6111
```

```
# upper bound:
coef(m.00)[1] + # the intercept plus
summary(m.00)$coef[1,2] * # the intercept's se
qt(0.975, df=7751) # the t-value for 2.5% at that df
```

```
## (Intercept)
## 667.3259
```

Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number â€¦

Does the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?

Some exploration of the relevant variables:

```
# the predictor(s)/response on its/their own
hist(d$RT); hist(d$RT, breaks="FD")
```

Very long tail on the right â€¦ Letâ€™s see where â€˜the middle 90%â€™ of the data are, which might be useful as a good plotting range:

`quantile(d$RT, probs=seq(0, 1, 0.05), na.rm=TRUE)`

```
## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
## 271.00 417.00 446.00 469.65 487.00 505.00 522.00 538.00 556.00 576.00
## 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
## 595.00 614.00 640.00 665.00 695.00 732.00 777.00 834.00 929.00 1137.45
## 100%
## 4130.00
```

Ok, so the middle 90% of the data are in [417, 1137.45], which means
that some values like that might make for good *y*-axis limits
(e.g.Â for some plots). Now the predictor:

`hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")`

```
# the predictor(s) w/ the response
plot(
main="RT in ms as a function of frequency per million words",
pch=16, col="#00000030",
xlab="Frequency pmw", xlim=c( 0, 1200), x=d$FREQPMW,
ylab="RT in ms" , ylim=c(250, 4250), y=d$RT); grid()
abline(lm(d$RT ~ d$FREQPMW), col="red", lwd=2)
```