# 1 Model assumptions & diagnostics

## 1.1 Amount of data

### 1.1.1 Linear regression

In session 03, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of

• the Zipf frequency of that word (`ZIPFFREQ`);
• the language that word was presented in (`LANGUAGE`: english vs.Â spanish);
• the speaker group that words was presented to (`GROUP`: english vs.Â heritage);
• any pairwise interaction of these predictors;
• the three-way interaction of these predictors.
``rm(list=ls(all.names=TRUE)); library(effects)``
``## Loading required package: carData``
``````## lattice theme set by effectsTheme()
## See ?effectsTheme for details.``````
``````str(d <- read.delim(       # summarize d, the result of loading
file="inputfiles/202_02-03_RTs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors``````
``````## 'data.frame':    8000 obs. of  9 variables:
##  \$ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
##  \$ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
##  \$ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
##  \$ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
##  \$ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
##  \$ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...``````
``````summary(m.01 <- lm(          # make/summarize the linear model m.01:
RT ~ 1 + ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactions
data=d, na.action=na.exclude))    # those vars are in d, skip NAs``````
``````##
## Call:
## lm(formula = RT ~ 1 + ZIPFFREQ * LANGUAGE * GROUP, data = d,
##     na.action = na.exclude)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -489.7 -141.2  -54.5   67.4 3388.8
##
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              713.44      51.32  13.902  < 2e-16 ***
## ZIPFFREQ                                 -30.84      11.00  -2.803 0.005081 **
## LANGUAGEspanish                          432.15      69.57   6.212  5.5e-10 ***
## GROUPheritage                            129.83      72.64   1.787 0.073922 .
## ZIPFFREQ:LANGUAGEspanish                 -51.65      14.97  -3.450 0.000563 ***
## ZIPFFREQ:GROUPheritage                   -17.39      15.56  -1.117 0.263824
## LANGUAGEspanish:GROUPheritage           -306.90      97.24  -3.156 0.001604 **
## ZIPFFREQ:LANGUAGEspanish:GROUPheritage    40.88      20.94   1.953 0.050891 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 250.4 on 7744 degrees of freedom
##   (248 observations deleted due to missingness)
## Multiple R-squared:  0.09514,    Adjusted R-squared:  0.09432
## F-statistic: 116.3 on 7 and 7744 DF,  p-value: < 2.2e-16``````

Did we have enough data points for what we did there?

How many data points did we model? Careful, itâ€™s not `nrow(d)`=8000 â€“ itâ€™s `nrow(model.frame(m.01))`=7752, namely `nrow(d)`=8000 minus the missing data points! With that, we ask ourselves

• what is the limiting sample size? Generally speaking, m is the sample size and, thus, our critical value is between 775.2 and 387.6.
• what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is `summary(m.01)\$fstatistic[2]`, which amounts to 7, which is of course the length of the modelâ€™s coefficient output minus 1: `length(coef(m.01))-1`=7.

Since p (7) is less than either m/10=775.2 or m/20=387.6, weâ€™re fine.

### 1.1.2 Binary logistic regression

In session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (of vs.Â s) varies as a function of

• a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor, but how the length of the possessor compares to the length of the possessum;
• the language that word was presented in (`POR_ANIMACY`: animate vs.Â collective vs.Â inanimate vs.Â locative vs.Â temporal);
• whether the speakers are non-native speakers or native speakers of English (SPEAKER: nns vs.Â ns);
• any pairwise interaction of these predictors;
• the three-way interaction of these predictors.
``````rm(list=ls(all.names=TRUE))
file="inputfiles/202_04-05_GENs.csv", # this file
stringsAsFactors=TRUE))    # change categorical variables into factors``````
``````## 'data.frame':    3600 obs. of  9 variables:
##  \$ CASE         : int  2 3 4 5 6 7 8 9 10 11 ...
##  \$ GENITIVE     : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
##  \$ SPEAKER      : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ MODALITY     : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ POR_LENGTH   : int  13 22 11 26 8 7 6 5 5 5 ...
##  \$ PUM_LENGTH   : int  7 7 8 4 4 3 36 3 5 5 ...
##  \$ POR_ANIMACY  : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
##  \$ POR_FINAL_SIB: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...
##  \$ POR_DEF      : Factor w/ 2 levels "definite","indefinite": 1 1 1 1 1 1 1 1 1 1 ...``````
``````d\$LEN_PORmPUM_LOG <- log2(d\$POR_LENGTH)-log2(d\$PUM_LENGTH)
summary(m.01 <- glm( # make/summarize the gen. linear model m.01:
GENITIVE ~ 1 + LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictors
data=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs``````
``````##
## Call:
## glm(formula = GENITIVE ~ 1 + LEN_PORmPUM_LOG * POR_ANIMACY *
##     SPEAKER, family = binomial, data = d, na.action = na.exclude)
##
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                      0.65632    0.09135   7.184
## LEN_PORmPUM_LOG                                 -0.70601    0.08154  -8.659
## POR_ANIMACYcollective                           -1.77888    0.16150 -11.015
## POR_ANIMACYinanimate                            -4.32698    0.21686 -19.953
## POR_ANIMACYlocative                             -2.30616    0.27227  -8.470
## POR_ANIMACYtemporal                             -1.32138    0.22536  -5.863
## SPEAKERns                                       -0.08278    0.17204  -0.481
## LEN_PORmPUM_LOG:POR_ANIMACYcollective           -0.46680    0.15054  -3.101
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate             0.32099    0.17729   1.811
## LEN_PORmPUM_LOG:POR_ANIMACYlocative             -0.29249    0.25885  -1.130
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal              0.44522    0.18375   2.423
## LEN_PORmPUM_LOG:SPEAKERns                        0.14245    0.16001   0.890
## POR_ANIMACYcollective:SPEAKERns                  0.79493    0.29353   2.708
## POR_ANIMACYinanimate:SPEAKERns                  -0.28854    0.51213  -0.563
## POR_ANIMACYlocative:SPEAKERns                   -0.50011    0.49785  -1.005
## POR_ANIMACYtemporal:SPEAKERns                    0.16161    0.41275   0.392
## LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns -0.14283    0.28943  -0.493
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns  -0.65372    0.43153  -1.515
## LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns   -0.19485    0.50679  -0.384
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns   -0.53061    0.38091  -1.393
##                                                 Pr(>|z|)
## (Intercept)                                     6.75e-13 ***
## LEN_PORmPUM_LOG                                  < 2e-16 ***
## POR_ANIMACYcollective                            < 2e-16 ***
## POR_ANIMACYinanimate                             < 2e-16 ***
## POR_ANIMACYlocative                              < 2e-16 ***
## POR_ANIMACYtemporal                             4.54e-09 ***
## SPEAKERns                                        0.63039
## LEN_PORmPUM_LOG:POR_ANIMACYcollective            0.00193 **
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate             0.07022 .
## LEN_PORmPUM_LOG:POR_ANIMACYlocative              0.25849
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal              0.01539 *
## LEN_PORmPUM_LOG:SPEAKERns                        0.37331
## POR_ANIMACYcollective:SPEAKERns                  0.00677 **
## POR_ANIMACYinanimate:SPEAKERns                   0.57315
## POR_ANIMACYlocative:SPEAKERns                    0.31511
## POR_ANIMACYtemporal:SPEAKERns                    0.69538
## LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns  0.62166
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns   0.12980
## LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns    0.70062
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns    0.16362
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 4004.3  on 3599  degrees of freedom
## Residual deviance: 2404.3  on 3580  degrees of freedom
## AIC: 2444.3
##
## Number of Fisher Scoring iterations: 7``````

Did we have enough data points for what we did there?

How many data points did we model? This time it is `nrow(d)`=3600 and of course also `nrow(model.frame(m.01))`=3600. With that, we ask ourselves

• what is the limiting sample size? Generally speaking, m is computed as `min(table(d\$GENITIVE))` and, thus, our critical value is between 88 and 44.
• what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is `m.01\$df.null-m.01\$df.residual`, which amounts to 19, which is of course the length of the modelâ€™s coefficient output minus 1: `length(coef(m.01))-1`=19.

Since p (19) is less than either m/10=88 or m/20=44, weâ€™re fine.

### 1.1.3 Ordinal logistic regression

In session 07, we fit an ordinal logistic regression model to answer the question(s) whether the rating of a Spanish construction involving the semantic role of an experiencer varies as a function of

• whether there is the right subject-verb agreement (`AGREEMENT`: no vs.Â yes);
• whether the clause is negated (`NEGATION`: affirmative vs.Â negative);
• whether the clause contains an emphatic pronoun (`EMPHPRON`: no vs.Â yes);
• what the order of the semantic and the grammatical subject is (`SUBJORDER`: gramsem vs.Â semgram);
• what the number of the grammatical subject is (`NUMPERGRAM`: plural vs.Â singular);
• what the person of the grammatical subject is (`PERSONGRAM`: first vs.Â second vs.Â third);
• any interaction of a predictor with `AGREEMENT`.
``````rm(list=ls(all.names=TRUE)); library(MASS)
file="inputfiles/202_07_RATG.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors``````
``````## 'data.frame':    4300 obs. of  8 variables:
##  \$ CASE      : int  1 2 3 4 5 6 7 8 9 10 ...
##  \$ JUDGMENT  : int  3 -3 3 3 -3 2 -2 2 -3 3 ...
##  \$ AGREEMENT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
##  \$ NEGATION  : Factor w/ 2 levels "affirmative",..: 1 1 1 1 1 1 1 1 1 1 ...
##  \$ EMPHPRON  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ SUBJORDER : Factor w/ 2 levels "gramsem","semgram": 1 1 1 1 1 1 1 1 2 2 ...
##  \$ NUMBERGRAM: Factor w/ 2 levels "plural","singular": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ PERSONGRAM: Factor w/ 3 levels "first","second",..: 3 3 3 3 3 3 3 3 3 3 ...``````
``````d\$JUDGMENT <- factor(d\$JUDGMENT, ordered=TRUE)
summary(m.01 <- polr(JUDGMENT ~ 1 + # summarize m.01, which models JUDGMENT ~
# AGREEMENT interacting w/ all other predictors:
AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),
Hess=TRUE, data=d, na.action=na.exclude))# recommended, vars are in d, skip NAs``````
``````## Call:
## polr(formula = JUDGMENT ~ 1 + AGREEMENT * (NEGATION + EMPHPRON +
##     SUBJORDER + NUMBERGRAM + PERSONGRAM), data = d, na.action = na.exclude,
##     Hess = TRUE)
##
## Coefficients:
##                                    Value Std. Error t value
## AGREEMENTyes                     1.00762    0.23047  4.3720
## NEGATIONnegative                -0.07963    0.07825 -1.0177
## EMPHPRONyes                      0.01820    0.07869  0.2313
## SUBJORDERsemgram                 0.26155    0.07833  3.3390
## NUMBERGRAMsingular              -0.28824    0.12844 -2.2441
## PERSONGRAMsecond                 0.12173    0.10676  1.1402
## PERSONGRAMthird                  0.41046    0.16241  2.5273
## AGREEMENTyes:NEGATIONnegative    0.06444    0.10980  0.5869
## AGREEMENTyes:EMPHPRONyes        -0.14410    0.11020 -1.3077
## AGREEMENTyes:SUBJORDERsemgram    0.39130    0.11081  3.5311
## AGREEMENTyes:NUMBERGRAMsingular  0.24572    0.18523  1.3266
## AGREEMENTyes:PERSONGRAMsecond   -0.05795    0.15453 -0.3750
## AGREEMENTyes:PERSONGRAMthird    -0.36910    0.23059 -1.6007
##
## Intercepts:
##       Value   Std. Error t value
## -3|-2 -1.2230  0.1653    -7.4002
## -2|-1 -0.4605  0.1635    -2.8162
## -1|0   0.0101  0.1632     0.0617
## 0|1    0.3675  0.1634     2.2493
## 1|2    0.7353  0.1637     4.4917
## 2|3    1.6316  0.1652     9.8743
##
## Residual Deviance: 15099.88
## AIC: 15137.88``````

Did we have enough data points for what we did there?

How many data points did we model? It again is `nrow(d)`=4300 and of course also `nrow(model.frame(m.01))`=4300; for `polr` objects, we could also get this from `m.01\$n`, which also returns 4300). With that, we ask ourselves

• what is the limiting sample size? Generally speaking, m is computed as `m.01\$n - (1/m.01\$n^2) * sum(table(d\$JUDGMENT)^3)`=4123.5 and, thus, our critical value is between m/10=412.4 and m/20=206.2.
• what is the number of predictors/coefficients we tried to estimate from the data p in this case? We can get that directly from the model object with `m.01\$edf` (edf for â€˜effective degrees of freedomâ€™) and we get 19.

Since p (19) is less than either m/10=412.4 or m/20=206.2, weâ€™re fine.

## 1.2 Residuals (of a linear model)

### 1.2.1 Example 1

In session 02, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words, `FREQPMW`). We first re-load the data and refit the model, â€¦

``````rm(list=ls(all.names=TRUE)); library(MASS)
file="inputfiles/202_02-03_RTs.csv", # this file
stringsAsFactors=TRUE))   # change categorical variables into factors``````
``````## 'data.frame':    8000 obs. of  9 variables:
##  \$ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
##  \$ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
##  \$ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
##  \$ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
##  \$ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
##  \$ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...``````
``````summary(m.01 <- lm(               # make/summarize the linear model m.01:
RT ~ 1 + FREQPMW,              # RT ~ this predictor
data=d, na.action=na.exclude)) # vars are in d, skip NAs``````
``````##
## Call:
## lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -396.3 -156.1  -68.9   72.4 3468.1
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 673.83892    3.65267 184.479  < 2e-16 ***
## FREQPMW      -0.14926    0.02546  -5.863 4.73e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262.5 on 7750 degrees of freedom
##   (248 observations deleted due to missingness)
## Multiple R-squared:  0.004416,   Adjusted R-squared:  0.004287
## F-statistic: 34.37 on 1 and 7750 DF,  p-value: 4.734e-09``````

â€¦ and now we generate three model-diagnostic plots for this model to see whether this model was a â€˜good oneâ€™:

``````par(mfrow=c(1, 3))    # make the plotting window have 1 rows & 3 columns
plot(m.01, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m.01), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1))    # reset to default``````