1 Model assumptions & diagnostics

1.1 Amount of data

1.1.1 Linear regression

In session 03, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of

  • the Zipf frequency of that word (ZIPFFREQ);
  • the language that word was presented in (LANGUAGE: english vs. spanish);
  • the speaker group that words was presented to (GROUP: english vs. heritage);
  • any pairwise interaction of these predictors;
  • the three-way interaction of these predictors.
rm(list=ls(all.names=TRUE)); library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
str(d <- read.delim(       # summarize d, the result of loading
   file="inputfiles/202_02-03_RTs.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
## 'data.frame':    8000 obs. of  9 variables:
##  $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
##  $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
##  $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
##  $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
##  $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
##  $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
##  $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
summary(m.01 <- lm(          # make/summarize the linear model m.01:
   RT ~ 1 + ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactions
   data=d, na.action=na.exclude))    # those vars are in d, skip NAs
##
## Call:
## lm(formula = RT ~ 1 + ZIPFFREQ * LANGUAGE * GROUP, data = d,
##     na.action = na.exclude)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -489.7 -141.2  -54.5   67.4 3388.8
##
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              713.44      51.32  13.902  < 2e-16 ***
## ZIPFFREQ                                 -30.84      11.00  -2.803 0.005081 **
## LANGUAGEspanish                          432.15      69.57   6.212  5.5e-10 ***
## GROUPheritage                            129.83      72.64   1.787 0.073922 .
## ZIPFFREQ:LANGUAGEspanish                 -51.65      14.97  -3.450 0.000563 ***
## ZIPFFREQ:GROUPheritage                   -17.39      15.56  -1.117 0.263824
## LANGUAGEspanish:GROUPheritage           -306.90      97.24  -3.156 0.001604 **
## ZIPFFREQ:LANGUAGEspanish:GROUPheritage    40.88      20.94   1.953 0.050891 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 250.4 on 7744 degrees of freedom
##   (248 observations deleted due to missingness)
## Multiple R-squared:  0.09514,    Adjusted R-squared:  0.09432
## F-statistic: 116.3 on 7 and 7744 DF,  p-value: < 2.2e-16

Did we have enough data points for what we did there?

How many data points did we model? Careful, it’s not nrow(d)=8000 – it’s nrow(model.frame(m.01))=7752, namely nrow(d)=8000 minus the missing data points! With that, we ask ourselves

  • what is the limiting sample size? Generally speaking, m is the sample size and, thus, our critical value is between 775.2 and 387.6.
  • what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is summary(m.01)$fstatistic[2], which amounts to 7, which is of course the length of the model’s coefficient output minus 1: length(coef(m.01))-1=7.

Since p (7) is less than either m/10=775.2 or m/20=387.6, we’re fine.

1.1.2 Binary logistic regression

In session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (of vs. s) varies as a function of

  • a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor, but how the length of the possessor compares to the length of the possessum;
  • the language that word was presented in (POR_ANIMACY: animate vs. collective vs. inanimate vs. locative vs. temporal);
  • whether the speakers are non-native speakers or native speakers of English (SPEAKER: nns vs. ns);
  • any pairwise interaction of these predictors;
  • the three-way interaction of these predictors.
rm(list=ls(all.names=TRUE))
str(d <- read.delim(          # summarize d, the result of loading
   file="inputfiles/202_04-05_GENs.csv", # this file
   stringsAsFactors=TRUE))    # change categorical variables into factors
## 'data.frame':    3600 obs. of  9 variables:
##  $ CASE         : int  2 3 4 5 6 7 8 9 10 11 ...
##  $ GENITIVE     : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
##  $ SPEAKER      : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MODALITY     : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
##  $ POR_LENGTH   : int  13 22 11 26 8 7 6 5 5 5 ...
##  $ PUM_LENGTH   : int  7 7 8 4 4 3 36 3 5 5 ...
##  $ POR_ANIMACY  : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
##  $ POR_FINAL_SIB: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...
##  $ POR_DEF      : Factor w/ 2 levels "definite","indefinite": 1 1 1 1 1 1 1 1 1 1 ...
d$LEN_PORmPUM_LOG <- log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)
summary(m.01 <- glm( # make/summarize the gen. linear model m.01:
   GENITIVE ~ 1 + LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictors
   data=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs
##
## Call:
## glm(formula = GENITIVE ~ 1 + LEN_PORmPUM_LOG * POR_ANIMACY *
##     SPEAKER, family = binomial, data = d, na.action = na.exclude)
##
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                      0.65632    0.09135   7.184
## LEN_PORmPUM_LOG                                 -0.70601    0.08154  -8.659
## POR_ANIMACYcollective                           -1.77888    0.16150 -11.015
## POR_ANIMACYinanimate                            -4.32698    0.21686 -19.953
## POR_ANIMACYlocative                             -2.30616    0.27227  -8.470
## POR_ANIMACYtemporal                             -1.32138    0.22536  -5.863
## SPEAKERns                                       -0.08278    0.17204  -0.481
## LEN_PORmPUM_LOG:POR_ANIMACYcollective           -0.46680    0.15054  -3.101
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate             0.32099    0.17729   1.811
## LEN_PORmPUM_LOG:POR_ANIMACYlocative             -0.29249    0.25885  -1.130
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal              0.44522    0.18375   2.423
## LEN_PORmPUM_LOG:SPEAKERns                        0.14245    0.16001   0.890
## POR_ANIMACYcollective:SPEAKERns                  0.79493    0.29353   2.708
## POR_ANIMACYinanimate:SPEAKERns                  -0.28854    0.51213  -0.563
## POR_ANIMACYlocative:SPEAKERns                   -0.50011    0.49785  -1.005
## POR_ANIMACYtemporal:SPEAKERns                    0.16161    0.41275   0.392
## LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns -0.14283    0.28943  -0.493
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns  -0.65372    0.43153  -1.515
## LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns   -0.19485    0.50679  -0.384
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns   -0.53061    0.38091  -1.393
##                                                 Pr(>|z|)
## (Intercept)                                     6.75e-13 ***
## LEN_PORmPUM_LOG                                  < 2e-16 ***
## POR_ANIMACYcollective                            < 2e-16 ***
## POR_ANIMACYinanimate                             < 2e-16 ***
## POR_ANIMACYlocative                              < 2e-16 ***
## POR_ANIMACYtemporal                             4.54e-09 ***
## SPEAKERns                                        0.63039
## LEN_PORmPUM_LOG:POR_ANIMACYcollective            0.00193 **
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate             0.07022 .
## LEN_PORmPUM_LOG:POR_ANIMACYlocative              0.25849
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal              0.01539 *
## LEN_PORmPUM_LOG:SPEAKERns                        0.37331
## POR_ANIMACYcollective:SPEAKERns                  0.00677 **
## POR_ANIMACYinanimate:SPEAKERns                   0.57315
## POR_ANIMACYlocative:SPEAKERns                    0.31511
## POR_ANIMACYtemporal:SPEAKERns                    0.69538
## LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns  0.62166
## LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns   0.12980
## LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns    0.70062
## LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns    0.16362
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 4004.3  on 3599  degrees of freedom
## Residual deviance: 2404.3  on 3580  degrees of freedom
## AIC: 2444.3
##
## Number of Fisher Scoring iterations: 7

Did we have enough data points for what we did there?

How many data points did we model? This time it is nrow(d)=3600 and of course also nrow(model.frame(m.01))=3600. With that, we ask ourselves

  • what is the limiting sample size? Generally speaking, m is computed as min(table(d$GENITIVE)) and, thus, our critical value is between 88 and 44.
  • what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is m.01$df.null-m.01$df.residual, which amounts to 19, which is of course the length of the model’s coefficient output minus 1: length(coef(m.01))-1=19.

Since p (19) is less than either m/10=88 or m/20=44, we’re fine.

1.1.3 Ordinal logistic regression

In session 07, we fit an ordinal logistic regression model to answer the question(s) whether the rating of a Spanish construction involving the semantic role of an experiencer varies as a function of

  • whether there is the right subject-verb agreement (AGREEMENT: no vs. yes);
  • whether the clause is negated (NEGATION: affirmative vs. negative);
  • whether the clause contains an emphatic pronoun (EMPHPRON: no vs. yes);
  • what the order of the semantic and the grammatical subject is (SUBJORDER: gramsem vs. semgram);
  • what the number of the grammatical subject is (NUMPERGRAM: plural vs. singular);
  • what the person of the grammatical subject is (PERSONGRAM: first vs. second vs. third);
  • any interaction of a predictor with AGREEMENT.
rm(list=ls(all.names=TRUE)); library(MASS)
str(d <- read.delim(   # summarize d, the result of loading
   file="inputfiles/202_07_RATG.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
## 'data.frame':    4300 obs. of  8 variables:
##  $ CASE      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ JUDGMENT  : int  3 -3 3 3 -3 2 -2 2 -3 3 ...
##  $ AGREEMENT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
##  $ NEGATION  : Factor w/ 2 levels "affirmative",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ EMPHPRON  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ SUBJORDER : Factor w/ 2 levels "gramsem","semgram": 1 1 1 1 1 1 1 1 2 2 ...
##  $ NUMBERGRAM: Factor w/ 2 levels "plural","singular": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PERSONGRAM: Factor w/ 3 levels "first","second",..: 3 3 3 3 3 3 3 3 3 3 ...
d$JUDGMENT <- factor(d$JUDGMENT, ordered=TRUE)
summary(m.01 <- polr(JUDGMENT ~ 1 + # summarize m.01, which models JUDGMENT ~
   # AGREEMENT interacting w/ all other predictors:
   AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),
   Hess=TRUE, data=d, na.action=na.exclude))# recommended, vars are in d, skip NAs
## Call:
## polr(formula = JUDGMENT ~ 1 + AGREEMENT * (NEGATION + EMPHPRON +
##     SUBJORDER + NUMBERGRAM + PERSONGRAM), data = d, na.action = na.exclude,
##     Hess = TRUE)
##
## Coefficients:
##                                    Value Std. Error t value
## AGREEMENTyes                     1.00762    0.23047  4.3720
## NEGATIONnegative                -0.07963    0.07825 -1.0177
## EMPHPRONyes                      0.01820    0.07869  0.2313
## SUBJORDERsemgram                 0.26155    0.07833  3.3390
## NUMBERGRAMsingular              -0.28824    0.12844 -2.2441
## PERSONGRAMsecond                 0.12173    0.10676  1.1402
## PERSONGRAMthird                  0.41046    0.16241  2.5273
## AGREEMENTyes:NEGATIONnegative    0.06444    0.10980  0.5869
## AGREEMENTyes:EMPHPRONyes        -0.14410    0.11020 -1.3077
## AGREEMENTyes:SUBJORDERsemgram    0.39130    0.11081  3.5311
## AGREEMENTyes:NUMBERGRAMsingular  0.24572    0.18523  1.3266
## AGREEMENTyes:PERSONGRAMsecond   -0.05795    0.15453 -0.3750
## AGREEMENTyes:PERSONGRAMthird    -0.36910    0.23059 -1.6007
##
## Intercepts:
##       Value   Std. Error t value
## -3|-2 -1.2230  0.1653    -7.4002
## -2|-1 -0.4605  0.1635    -2.8162
## -1|0   0.0101  0.1632     0.0617
## 0|1    0.3675  0.1634     2.2493
## 1|2    0.7353  0.1637     4.4917
## 2|3    1.6316  0.1652     9.8743
##
## Residual Deviance: 15099.88
## AIC: 15137.88

Did we have enough data points for what we did there?

How many data points did we model? It again is nrow(d)=4300 and of course also nrow(model.frame(m.01))=4300; for polr objects, we could also get this from m.01$n, which also returns 4300). With that, we ask ourselves

  • what is the limiting sample size? Generally speaking, m is computed as m.01$n - (1/m.01$n^2) * sum(table(d$JUDGMENT)^3)=4123.5 and, thus, our critical value is between m/10=412.4 and m/20=206.2.
  • what is the number of predictors/coefficients we tried to estimate from the data p in this case? We can get that directly from the model object with m.01$edf (edf for ‘effective degrees of freedom’) and we get 19.

Since p (19) is less than either m/10=412.4 or m/20=206.2, we’re fine.

1.2 Residuals (of a linear model)

1.2.1 Example 1

In session 02, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words, FREQPMW). We first re-load the data and refit the model, …

rm(list=ls(all.names=TRUE)); library(MASS)
str(d <- read.delim(         # summarize d, the result of loading
   file="inputfiles/202_02-03_RTs.csv", # this file
   stringsAsFactors=TRUE))   # change categorical variables into factors
## 'data.frame':    8000 obs. of  9 variables:
##  $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
##  $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
##  $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
##  $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
##  $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
##  $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
##  $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
summary(m.01 <- lm(               # make/summarize the linear model m.01:
   RT ~ 1 + FREQPMW,              # RT ~ this predictor
   data=d, na.action=na.exclude)) # vars are in d, skip NAs
##
## Call:
## lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -396.3 -156.1  -68.9   72.4 3468.1
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 673.83892    3.65267 184.479  < 2e-16 ***
## FREQPMW      -0.14926    0.02546  -5.863 4.73e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262.5 on 7750 degrees of freedom
##   (248 observations deleted due to missingness)
## Multiple R-squared:  0.004416,   Adjusted R-squared:  0.004287
## F-statistic: 34.37 on 1 and 7750 DF,  p-value: 4.734e-09

… and now we generate three model-diagnostic plots for this model to see whether this model was a ‘good one’:

par(mfrow=c(1, 3))    # make the plotting window have 1 rows & 3 columns
plot(m.01, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m.01), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1))    # reset to default