We are dealing with the same data set as last session, but now in a multifactorial way. Specifically, we are asking, does the reaction time to a word (in ms) vary as a function of
ZIPFFREQ
);LANGUAGE
:
english vs. spanish);GROUP
:
english vs. heritage);rm(list=ls(all.names=TRUE))
library(car); library(effects)
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE RT LENGTH LANGUAGE GROUP
## Min. : 1 Min. : 271.0 Min. :3.000 english:4023 english :3961
## 1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spanish:3977 heritage:4039
## Median :4310 Median : 595.0 Median :5.000
## Mean :4303 Mean : 661.5 Mean :5.198
## 3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
## Max. :8610 Max. :4130.0 Max. :9.000
## NA's :248
## CORRECT FREQPMW ZIPFFREQ SITE
## correct :7749 Min. : 1.00 Min. :3.000 a:2403
## incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
## Median : 42.00 Median :4.623 c:2782
## Mean : 81.14 Mean :4.591
## 3rd Qu.: 101.00 3rd Qu.:5.004
## Max. :1152.00 Max. :6.061
##
Since there are some missing data but only in the response variable,
we immediately reduce d
to all the complete cases a
regression would consider:
summary(d <- droplevels(d[complete.cases(d),])) # note the droplevels!
## CASE RT LENGTH LANGUAGE GROUP
## Min. : 1 Min. : 271.0 Min. :3.0 english:3977 english :3793
## 1st Qu.:2123 1st Qu.: 505.0 1st Qu.:4.0 spanish:3775 heritage:3959
## Median :4262 Median : 595.0 Median :5.0
## Mean :4268 Mean : 661.5 Mean :5.2
## 3rd Qu.:6405 3rd Qu.: 732.0 3rd Qu.:6.0
## Max. :8610 Max. :4130.0 Max. :9.0
## CORRECT FREQPMW ZIPFFREQ SITE
## correct :7749 Min. : 1.00 Min. :3.000 a:2403
## incorrect: 3 1st Qu.: 18.00 1st Qu.:4.255 b:2815
## Median : 43.00 Median :4.633 c:2534
## Mean : 82.88 Mean :4.609
## 3rd Qu.: 104.00 3rd Qu.:5.017
## Max. :1152.00 Max. :6.061
Some exploration of the relevant variables:
# the predictor(s)/response on its/their own
hist(d$RT); hist(d$RT, breaks="FD")
hist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")
table(d$LANGUAGE)
##
## english spanish
## 3977 3775
table(d$GROUP)
##
## english heritage
## 3793 3959
# the predictor(s) w/ the response
plot(
main="RT in ms as a function of frequency per million words",
pch=16, col="#00000030",
xlab="Zipf frequency", xlim=c( 3, 6), x=d$ZIPFFREQ,
ylab="RT (in ms)" , ylim=c(250, 4250), y=d$RT); grid()
abline(lm(d$RT ~ d$ZIPFFREQ), col="green", lwd=2)