1 Intro

Let’s load a data set for a few linear modeling examples:

rm(list=ls(all.names=TRUE))
library(car); library(effects)
summary(d <- read.delim(   # summarize d, the result of loading
   file="inputfiles/202_02-03_RTs.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
##       CASE            RT             LENGTH         LANGUAGE         GROUP     
##  Min.   :   1   Min.   : 271.0   Min.   :3.000   english:4023   english :3961  
##  1st Qu.:2150   1st Qu.: 505.0   1st Qu.:4.000   spanish:3977   heritage:4039  
##  Median :4310   Median : 595.0   Median :5.000                                 
##  Mean   :4303   Mean   : 661.5   Mean   :5.198                                 
##  3rd Qu.:6450   3rd Qu.: 732.0   3rd Qu.:6.000                                 
##  Max.   :8610   Max.   :4130.0   Max.   :9.000                                 
##                 NA's   :248                                                    
##       CORRECT        FREQPMW           ZIPFFREQ     SITE    
##  correct  :7749   Min.   :   1.00   Min.   :3.000   a:2403  
##  incorrect: 251   1st Qu.:  17.00   1st Qu.:4.230   b:2815  
##                   Median :  42.00   Median :4.623   c:2782  
##                   Mean   :  81.14   Mean   :4.591           
##                   3rd Qu.: 101.00   3rd Qu.:5.004           
##                   Max.   :1152.00   Max.   :6.061           
## 

Here’s what the columns are/contain:

  • the column CASE is a simple case/row number just so that each case has a unique ID;
  • the column RT is the response variable: it’s the reaction time in ms that indicates how long a speaker needed to decide whether the word displayed on a computer screen was a real word or not;
  • the column LENGTH is a numeric predictor: it’s the length of the stimulus word displayed on the computer screen in characters; usually, longer words need longer reaction times;
  • the column LANGUAGE is a binary predictor: it’s the language of the stimulus word displayed on the computer screen (english vs. spanish);
  • the column GROUP is a binary predictor: it’s whether the speaker to which the stimulus word was presented was a learner of Spanish (english) or a heritage speaker of Spanish (heritage);
  • the column CORRECT is a binary predictor or control: it’s whether the speaker correctly classified the stimulus word displayed on the computer screen as a word or a non-word (correct vs. incorrect);
  • the column FREQPMW is a numeric predictor or control: it’s the frequency of the stimulus word per million words in a large reference corpus;
  • the column ZIPFFREQ is a numeric predictor or control: it’s the transformed frequency of the stimulus word in a large reference corpus (the transformation included a log to the base of 10);
  • the column SITE is a ternary predictor or control: it’s at which of three locations speakers participated in the experiment (a vs. b vs. c).

As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.

2 Deviance

A useful thing to consider is deviance, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable’s values are not all the mean, they vary around it:

summary(d$RT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   271.0   505.0   595.0   661.5   732.0  4130.0     248

One way to express the degree to which the values of RT vary around the mean would be the average absolute deviation:

mean(abs(d$RT - mean(d$RT, na.rm=TRUE)), na.rm=TRUE)
## [1] 172.5416

But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called deviance:

sum((d$RT - mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)
## [1] 536471586

This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called null model, i.e. a model with no predictors:

summary(m.00 <- lm(       # make/summarize the linear model m.01:
   RT ~ 1,                # RT ~ an overall intercept (1) & no predictor
   data=d,                # those variables are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)
## 
## Call:
## lm(formula = RT ~ 1, data = d, na.action = na.exclude)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -390.5 -156.5  -66.5   70.5 3468.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  661.469      2.988   221.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263.1 on 7751 degrees of freedom
##   (248 observations deleted due to missingness)

That model’s intercept is the overall mean of the response variable and that model’s deviance is the same as what we computed manually just above:

deviance(m.00)
## [1] 536471586

Note that this null model actually returns the same result as when you use a one-sample t-test on the reaction times; compare the mean and the t-value:

t.test(d$RT)
## 
##  One Sample t-test
## 
## data:  d$RT
## t = 221.37, df = 7751, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  655.6111 667.3259
## sample estimates:
## mean of x 
##  661.4685

And this is how we’d get the confidence intervals that t.test provides from the regression model:

Confint(m.00)
##             Estimate    2.5 %   97.5 %
## (Intercept) 661.4685 655.6111 667.3259

And the manual computation would be this:

# lower bound:
coef(m.00)[1] +              # the intercept plus
   summary(m.00)$coef[1,2] * # the intercept's se
   qt(0.025, df=7751)        # the t-value for 2.5% at that df
## (Intercept) 
##    655.6111
# upper bound:
coef(m.00)[1] +              # the intercept plus
   summary(m.00)$coef[1,2] * # the intercept's se
   qt(0.975, df=7751)        # the t-value for 2.5% at that df
## (Intercept) 
##    667.3259

Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …

3 A numeric predictor

Does the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?

3.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
hist(d$RT); hist(d$RT, breaks="FD")

Very long tail on the right … Let’s see where ‘the middle 90%’ of the data are, which might be useful as a good plotting range:

quantile(d$RT, probs=seq(0, 1, 0.05), na.rm=TRUE)
##      0%      5%     10%     15%     20%     25%     30%     35%     40%     45% 
##  271.00  417.00  446.00  469.65  487.00  505.00  522.00  538.00  556.00  576.00 
##     50%     55%     60%     65%     70%     75%     80%     85%     90%     95% 
##  595.00  614.00  640.00  665.00  695.00  732.00  777.00  834.00  929.00 1137.45 
##    100% 
## 4130.00

Ok, so the middle 90% of the data are in [417, 1137.45], which means that some values like that might make for good y-axis limits (e.g. for some plots). Now the predictor:

hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")

# the predictor(s) w/ the response
plot(
   main="RT in ms as a function of frequency per million words",
   pch=16, col="#00000030",
   xlab="Frequency pmw", xlim=c(  0, 1200), x=d$FREQPMW,
   ylab="RT in ms"     , ylim=c(250, 4250), y=d$RT); grid()
abline(lm(d$RT ~ d$FREQPMW), col="red", lwd=2)