In session 03, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of
ZIPFFREQ
);LANGUAGE
:
english vs. spanish);GROUP
:
english vs. heritage);rm(list=ls(all.names=TRUE)); library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
str(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## 'data.frame': 8000 obs. of 9 variables:
## $ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
## $ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
## $ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
## $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
## $ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
## $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
## $ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
## $ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
## $ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
summary(m.01 <- lm( # make/summarize the linear model m.01:
RT ~ 1 + ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactions
data=d, na.action=na.exclude)) # those vars are in d, skip NAs
##
## Call:
## lm(formula = RT ~ 1 + ZIPFFREQ * LANGUAGE * GROUP, data = d,
## na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -489.7 -141.2 -54.5 67.4 3388.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 713.44 51.32 13.902 < 2e-16 ***
## ZIPFFREQ -30.84 11.00 -2.803 0.005081 **
## LANGUAGEspanish 432.15 69.57 6.212 5.5e-10 ***
## GROUPheritage 129.83 72.64 1.787 0.073922 .
## ZIPFFREQ:LANGUAGEspanish -51.65 14.97 -3.450 0.000563 ***
## ZIPFFREQ:GROUPheritage -17.39 15.56 -1.117 0.263824
## LANGUAGEspanish:GROUPheritage -306.90 97.24 -3.156 0.001604 **
## ZIPFFREQ:LANGUAGEspanish:GROUPheritage 40.88 20.94 1.953 0.050891 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 250.4 on 7744 degrees of freedom
## (248 observations deleted due to missingness)
## Multiple R-squared: 0.09514, Adjusted R-squared: 0.09432
## F-statistic: 116.3 on 7 and 7744 DF, p-value: < 2.2e-16
Did we have enough data points for what we did there?
How many data points did we model? Careful, it’s not
nrow(d)
=8000 – it’s
nrow(model.frame(m.01))
=7752, namely
nrow(d)
=8000 minus the missing data points! With that, we
ask ourselves
summary(m.01)$fstatistic[2]
, which amounts
to 7, which is of course the length of the model’s coefficient output
minus 1: length(coef(m.01))-1
=7.Since p (7) is less than either m/10=775.2 or m/20=387.6, we’re fine.
In session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (of vs. s) varies as a function of
POR_ANIMACY
:
animate vs. collective vs. inanimate
vs. locative vs. temporal);rm(list=ls(all.names=TRUE))
str(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_04-05_GENs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## 'data.frame': 3600 obs. of 9 variables:
## $ CASE : int 2 3 4 5 6 7 8 9 10 11 ...
## $ GENITIVE : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
## $ SPEAKER : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
## $ MODALITY : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
## $ POR_LENGTH : int 13 22 11 26 8 7 6 5 5 5 ...
## $ PUM_LENGTH : int 7 7 8 4 4 3 36 3 5 5 ...
## $ POR_ANIMACY : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
## $ POR_FINAL_SIB: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...
## $ POR_DEF : Factor w/ 2 levels "definite","indefinite": 1 1 1 1 1 1 1 1 1 1 ...
d$LEN_PORmPUM_LOG <- log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)
summary(m.01 <- glm( # make/summarize the gen. linear model m.01:
GENITIVE ~ 1 + LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictors
data=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs
##
## Call:
## glm(formula = GENITIVE ~ 1 + LEN_PORmPUM_LOG * POR_ANIMACY *
## SPEAKER, family = binomial, data = d, na.action = na.exclude)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.65632 0.09135 7.184
## LEN_PORmPUM_LOG -0.70601 0.08154 -8.659
## POR_ANIMACYcollective -1.77888 0.16150 -11.015
## POR_ANIMACYinanimate -4.32698 0.21686 -19.953
## POR_ANIMACYlocative -2.30616 0.27227 -8.470
## POR_ANIMACYtemporal -1.32138 0.22536 -5.863
## SPEAKERns -0.08278 0.17204 -0.481
## LEN_PORmPUM_LOG:POR_ANIMACYcollective -0.46680 0.15054 -3.101
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate 0.32099 0.17729 1.811
## LEN_PORmPUM_LOG:POR_ANIMACYlocative -0.29249 0.25885 -1.130
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal 0.44522 0.18375 2.423
## LEN_PORmPUM_LOG:SPEAKERns 0.14245 0.16001 0.890
## POR_ANIMACYcollective:SPEAKERns 0.79493 0.29353 2.708
## POR_ANIMACYinanimate:SPEAKERns -0.28854 0.51213 -0.563
## POR_ANIMACYlocative:SPEAKERns -0.50011 0.49785 -1.005
## POR_ANIMACYtemporal:SPEAKERns 0.16161 0.41275 0.392
## LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns -0.14283 0.28943 -0.493
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns -0.65372 0.43153 -1.515
## LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns -0.19485 0.50679 -0.384
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns -0.53061 0.38091 -1.393
## Pr(>|z|)
## (Intercept) 6.75e-13 ***
## LEN_PORmPUM_LOG < 2e-16 ***
## POR_ANIMACYcollective < 2e-16 ***
## POR_ANIMACYinanimate < 2e-16 ***
## POR_ANIMACYlocative < 2e-16 ***
## POR_ANIMACYtemporal 4.54e-09 ***
## SPEAKERns 0.63039
## LEN_PORmPUM_LOG:POR_ANIMACYcollective 0.00193 **
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate 0.07022 .
## LEN_PORmPUM_LOG:POR_ANIMACYlocative 0.25849
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal 0.01539 *
## LEN_PORmPUM_LOG:SPEAKERns 0.37331
## POR_ANIMACYcollective:SPEAKERns 0.00677 **
## POR_ANIMACYinanimate:SPEAKERns 0.57315
## POR_ANIMACYlocative:SPEAKERns 0.31511
## POR_ANIMACYtemporal:SPEAKERns 0.69538
## LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns 0.62166
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns 0.12980
## LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns 0.70062
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns 0.16362
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4004.3 on 3599 degrees of freedom
## Residual deviance: 2404.3 on 3580 degrees of freedom
## AIC: 2444.3
##
## Number of Fisher Scoring iterations: 7
Did we have enough data points for what we did there?
How many data points did we model? This time it is
nrow(d)
=3600 and of course also
nrow(model.frame(m.01))
=3600. With that, we ask
ourselves
min(table(d$GENITIVE))
and, thus, our critical
value is between 88 and 44.m.01$df.null-m.01$df.residual
, which amounts
to 19, which is of course the length of the model’s coefficient output
minus 1: length(coef(m.01))-1
=19.Since p (19) is less than either m/10=88 or m/20=44, we’re fine.
In session 07, we fit an ordinal logistic regression model to answer the question(s) whether the rating of a Spanish construction involving the semantic role of an experiencer varies as a function of
AGREEMENT
: no vs. yes);NEGATION
:
affirmative vs. negative);EMPHPRON
: no vs. yes);SUBJORDER
: gramsem vs. semgram);NUMPERGRAM
: plural vs. singular);PERSONGRAM
: first vs. second
vs. third);AGREEMENT
.rm(list=ls(all.names=TRUE)); library(MASS)
str(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_07_RATG.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## 'data.frame': 4300 obs. of 8 variables:
## $ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
## $ JUDGMENT : int 3 -3 3 3 -3 2 -2 2 -3 3 ...
## $ AGREEMENT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
## $ NEGATION : Factor w/ 2 levels "affirmative",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ EMPHPRON : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ SUBJORDER : Factor w/ 2 levels "gramsem","semgram": 1 1 1 1 1 1 1 1 2 2 ...
## $ NUMBERGRAM: Factor w/ 2 levels "plural","singular": 1 1 1 1 1 1 1 1 1 1 ...
## $ PERSONGRAM: Factor w/ 3 levels "first","second",..: 3 3 3 3 3 3 3 3 3 3 ...
d$JUDGMENT <- factor(d$JUDGMENT, ordered=TRUE)
summary(m.01 <- polr(JUDGMENT ~ 1 + # summarize m.01, which models JUDGMENT ~
# AGREEMENT interacting w/ all other predictors:
AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),
Hess=TRUE, data=d, na.action=na.exclude))# recommended, vars are in d, skip NAs
## Call:
## polr(formula = JUDGMENT ~ 1 + AGREEMENT * (NEGATION + EMPHPRON +
## SUBJORDER + NUMBERGRAM + PERSONGRAM), data = d, na.action = na.exclude,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## AGREEMENTyes 1.00762 0.23047 4.3720
## NEGATIONnegative -0.07963 0.07825 -1.0177
## EMPHPRONyes 0.01820 0.07869 0.2313
## SUBJORDERsemgram 0.26155 0.07833 3.3390
## NUMBERGRAMsingular -0.28824 0.12844 -2.2441
## PERSONGRAMsecond 0.12173 0.10676 1.1402
## PERSONGRAMthird 0.41046 0.16241 2.5273
## AGREEMENTyes:NEGATIONnegative 0.06444 0.10980 0.5869
## AGREEMENTyes:EMPHPRONyes -0.14410 0.11020 -1.3077
## AGREEMENTyes:SUBJORDERsemgram 0.39130 0.11081 3.5311
## AGREEMENTyes:NUMBERGRAMsingular 0.24572 0.18523 1.3266
## AGREEMENTyes:PERSONGRAMsecond -0.05795 0.15453 -0.3750
## AGREEMENTyes:PERSONGRAMthird -0.36910 0.23059 -1.6007
##
## Intercepts:
## Value Std. Error t value
## -3|-2 -1.2230 0.1653 -7.4002
## -2|-1 -0.4605 0.1635 -2.8162
## -1|0 0.0101 0.1632 0.0617
## 0|1 0.3675 0.1634 2.2493
## 1|2 0.7353 0.1637 4.4917
## 2|3 1.6316 0.1652 9.8743
##
## Residual Deviance: 15099.88
## AIC: 15137.88
Did we have enough data points for what we did there?
How many data points did we model? It again is
nrow(d)
=4300 and of course also
nrow(model.frame(m.01))
=4300; for polr
objects, we could also get this from m.01$n
, which also
returns 4300). With that, we ask ourselves
m.01$n - (1/m.01$n^2) * sum(table(d$JUDGMENT)^3)
=4123.5
and, thus, our critical value is between
m/10=412.4 and
m/20=206.2.m.01$edf
(edf for ‘effective
degrees of freedom’) and we get 19.Since p (19) is less than either m/10=412.4 or m/20=206.2, we’re fine.
In session 02, we fit a linear model to answer the question whether
the reaction time to a word (in ms) varies as a function of the
frequency of that word (normalized to per million words,
FREQPMW
). We first re-load the data and refit the model,
…
rm(list=ls(all.names=TRUE)); library(MASS)
str(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## 'data.frame': 8000 obs. of 9 variables:
## $ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
## $ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
## $ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
## $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
## $ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
## $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
## $ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
## $ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
## $ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
summary(m.01 <- lm( # make/summarize the linear model m.01:
RT ~ 1 + FREQPMW, # RT ~ this predictor
data=d, na.action=na.exclude)) # vars are in d, skip NAs
##
## Call:
## lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -396.3 -156.1 -68.9 72.4 3468.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 673.83892 3.65267 184.479 < 2e-16 ***
## FREQPMW -0.14926 0.02546 -5.863 4.73e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262.5 on 7750 degrees of freedom
## (248 observations deleted due to missingness)
## Multiple R-squared: 0.004416, Adjusted R-squared: 0.004287
## F-statistic: 34.37 on 1 and 7750 DF, p-value: 4.734e-09
… and now we generate three model-diagnostic plots for this model to see whether this model was a ‘good one’:
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(m.01, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m.01), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default