1 Intro: A multifactorial model selection process

We are dealing with the same data set as last session, but now in a multifactorial way. Specifically, we are asking, does the reaction time to a word (in ms) vary as a function of

  • the Zipf frequency of that word (ZIPFFREQ);
  • the language that word was presented in (LANGUAGE: english vs. spanish);
  • the speaker group that words was presented to (GROUP: english vs. heritage);
  • any pairwise interaction of these predictors;
  • the three-way interaction of these predictors?
rm(list=ls(all.names=TRUE))
library(car); library(effects)
summary(d <- read.delim(   # summarize d, the result of loading
   file="inputfiles/202_02-03_RTs.csv",  # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
##       CASE            RT             LENGTH         LANGUAGE         GROUP
##  Min.   :   1   Min.   : 271.0   Min.   :3.000   english:4023   english :3961
##  1st Qu.:2150   1st Qu.: 505.0   1st Qu.:4.000   spanish:3977   heritage:4039
##  Median :4310   Median : 595.0   Median :5.000
##  Mean   :4303   Mean   : 661.5   Mean   :5.198
##  3rd Qu.:6450   3rd Qu.: 732.0   3rd Qu.:6.000
##  Max.   :8610   Max.   :4130.0   Max.   :9.000
##                 NA's   :248
##       CORRECT        FREQPMW           ZIPFFREQ     SITE
##  correct  :7749   Min.   :   1.00   Min.   :3.000   a:2403
##  incorrect: 251   1st Qu.:  17.00   1st Qu.:4.230   b:2815
##                   Median :  42.00   Median :4.623   c:2782
##                   Mean   :  81.14   Mean   :4.591
##                   3rd Qu.: 101.00   3rd Qu.:5.004
##                   Max.   :1152.00   Max.   :6.061
## 

1.1 Exploration & preparation

Since there are some missing data but only in the response variable, we immediately reduce d to all the complete cases a regression would consider:

summary(d <- droplevels(d[complete.cases(d),])) # note the droplevels!
##       CASE            RT             LENGTH       LANGUAGE         GROUP
##  Min.   :   1   Min.   : 271.0   Min.   :3.0   english:3977   english :3793
##  1st Qu.:2123   1st Qu.: 505.0   1st Qu.:4.0   spanish:3775   heritage:3959
##  Median :4262   Median : 595.0   Median :5.0
##  Mean   :4268   Mean   : 661.5   Mean   :5.2
##  3rd Qu.:6405   3rd Qu.: 732.0   3rd Qu.:6.0
##  Max.   :8610   Max.   :4130.0   Max.   :9.0
##       CORRECT        FREQPMW           ZIPFFREQ     SITE
##  correct  :7749   Min.   :   1.00   Min.   :3.000   a:2403
##  incorrect:   3   1st Qu.:  18.00   1st Qu.:4.255   b:2815
##                   Median :  43.00   Median :4.633   c:2534
##                   Mean   :  82.88   Mean   :4.609
##                   3rd Qu.: 104.00   3rd Qu.:5.017
##                   Max.   :1152.00   Max.   :6.061

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
hist(d$RT); hist(d$RT, breaks="FD")

hist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")

table(d$LANGUAGE)
##
## english spanish
##    3977    3775
table(d$GROUP)
##
##  english heritage
##     3793     3959
# the predictor(s) w/ the response
plot(
   main="RT in ms as a function of frequency per million words",
   pch=16, col="#00000030",
   xlab="Zipf frequency", xlim=c(  3,    6), x=d$ZIPFFREQ,
   ylab="RT (in ms)"    , ylim=c(250, 4250), y=d$RT); grid()
abline(lm(d$RT ~ d$ZIPFFREQ), col="green", lwd=2)