On proper regression modeling in varieties research

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

26 May 2024 12-34-56

Abstract

One particularly prominent methodological development in linguistics is what has been termed the “quantitative turn”: Not only are more and more studies using statistical tools to explore data and to test hypotheses, the complexity of the statistical methods employed is growing as well. This development is particularly prominent in all kinds of corpus-linguistic studies: 20 years ago chi-squared tests, t-tests, and Pearson’s r reigned supreme, but now more and more corpus studies are using multivariate exploratory tools and, for hypothesis testing, multifactorial predictive modeling techniques, in particular regression models (and, increasingly, tree-based methods). However welcome this development is, it, and especially its pace as well as the fact that few places offer rigorous training in statistical methods, comes with its own risks, chief among them that analytical methods are misapplied, which can lead imprecise, incomplete, or wrong analyses. In this paper, I will revisit a recent regression-analytic study in the research area of English varieties (Parviainen & Fuchs 2018 on clause-final also and only in three Asian Englishes) to:

rm(list=ls(all=TRUE)); invisible(gc())
pkgs2load <- c("car", "effects", "emmans", "lme4", "MASS")
invisible(sapply(pkgs2load, library, character.only=TRUE,
   logical.return=TRUE, quietly=TRUE, verbose=FALSE, warn.conflicts=TRUE)); rm(pkgs2load)

1 Parviainen & Fuchs (2018): what is explored and how

Their analyses are based on the private-conversation parts of the studied components, for which relevant metadata for the speakers are available. They retrieved all instances of clause-final also/only from these corpus parts and put together a data frame shown here as Table 1 (their Appendix 1 with two cosmetic changes and my bolding to be explained below):

Table 1: P&F’s data as per their appendix 1
CASE PARTICLE VARIETY GENDER AGE SPKRS TOKENS WORDS TMWpaper
C001 also HKE female 14–25 84 45 107796 417
C002 also HKE female 26–35 3 0 5304 0
C003 also HKE female 36–50 6 0 5888 0
C004 also HKE female >50 0 0 0 0
C005 also HKE male 14–25 15 5 16599 301
C006 also HKE male 26–35 2 0 2892 0
C007 also HKE male 36–50 4 0 3204 0
C008 also HKE male >50 1 0 902 0
C009 also IndE female 14–25 52 74 43410 1705
C010 also IndE female 26–35 33 33 29456 1120
C011 also IndE female 36–50 27 40 25282 1582
C012 also IndE female >50 9 11 6757 1628
C013 also IndE male 14–25 18 14 14236 983
C014 also IndE male 26–35 26 24 25082 957
C015 also IndE male 36–50 37 31 34771 892
C016 also IndE male >50 37 35 31927 1096
C017 also PhiE female 14–25 170 37 93197 397
C018 also PhiE female 26–35 78 9 25083 359
C019 also PhiE female 36–50 54 6 11674 514
C020 also PhiE female >50 0 2 3613 554
C021 also PhiE male 14–25 69 6 32386 185
C022 also PhiE male 26–35 75 5 17283 289
C023 also PhiE male 36–50 106 6 6190 969
C024 also PhiE male >50 1 0 3937 0
C025 only HKE female 14–25 84 29 107796 269
C026 only HKE female 26–35 3 2 5304 377
C027 only HKE female 36–50 6 0 5888 0
C028 only HKE female >50 0 0 0 0
C029 only HKE male 14–25 15 1 16599 60
C030 only HKE male 26–35 2 0 2892 0
C031 only HKE male 36–50 4 0 3204 0
C032 only HKE male >50 1 0 902 0
C033 only IndE female 14–25 52 41 43410 944
C034 only IndE female 26–35 33 20 29456 679
C035 only IndE female 36–50 27 17 25282 672
C036 only IndE female >50 9 0 6757 0
C037 only IndE male 14–25 18 17 14236 1194
C038 only IndE male 26–35 26 14 25082 558
C039 only IndE male 36–50 37 10 34771 288
C040 only IndE male >50 37 7 31927 219
C041 only PhiE female 14–25 170 12 93197 129
C042 only PhiE female 26–35 78 1 25083 40
C043 only PhiE female 36–50 54 1 11674 86
C044 only PhiE female >50 0 0 3613 0
C045 only PhiE male 14–25 69 0 32386 0
C046 only PhiE male 26–35 78 1 17283 58
C047 only PhiE male 36–50 106 0 6190 0
C048 only PhiE male >50 1 1 3937 254
d <- read.delim("01b_also-only-ex.csv", stringsAsFactors=TRUE)
d$TMW <- (d$TOKENS/d$WORDS) * 1E6; d$TMW[is.na(d$TMW)] <- 0
d$VARIETY <- factor(d$VARIETY, levels=levels(d$VARIETY)[c(2,1,3)])
d$AGE <- factor(d$AGE, levels=levels(d$AGE)[c(2:4, 1)])
dd <- split(d[,-2], d$PARTICLE, drop=TRUE)
analysis.for.also <- step(lm(TMW ~ AGE*GENDER*VARIETY, data=dd$also))
analysis.for.only <- step(lm(TMW ~ AGE*GENDER*VARIETY, data=dd$only))
summary(m.final.also <- stepAIC(
   lm(TMW ~ 1, data=dd$also),
   # bidirectional model selection ...
   direction="both",
   # between these two extremes: lower null & upper full model:
   scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY),
   trace=0)) # do not show all steps

Call:
lm(formula = TMW ~ VARIETY + GENDER + VARIETY:GENDER, data = dd$also)

Residuals:
    Min      1Q  Median      3Q     Max
-388.46  -98.86  -65.25  101.85  608.34

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)              1508.8      117.4  12.856 1.65e-10 ***
VARIETYHKE              -1404.4      166.0  -8.462 1.09e-07 ***
VARIETYPhiE             -1052.9      166.0  -6.344 5.61e-06 ***
GENDERmale               -526.8      166.0  -3.174  0.00526 **
VARIETYHKE:GENDERmale     497.7      234.7   2.120  0.04813 *
VARIETYPhiE:GENDERmale    431.9      234.7   1.840  0.08232 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 234.7 on 18 degrees of freedom
Multiple R-squared:  0.8635,    Adjusted R-squared:  0.8256
F-statistic: 22.78 on 5 and 18 DF,  p-value: 3.33e-07
summary(m.final.only <- stepAIC(
   lm(TMW ~ 1, data=dd$only), direction="both",
   scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY),
   trace=0))

Call:
lm(formula = TMW ~ VARIETY + AGE + VARIETY:AGE, data = dd$only)

Residuals:
    Min      1Q  Median      3Q     Max
-192.41  -74.38    0.00   74.38  192.41

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)            1069.3      106.8  10.011 3.54e-07 ***
VARIETYHKE             -904.7      151.1  -5.989 6.32e-05 ***
VARIETYPhiE           -1004.9      151.1  -6.653 2.35e-05 ***
AGE26–35               -450.7      151.1  -2.984 0.011401 *
AGE36–50               -589.3      151.1  -3.901 0.002105 **
AGE>50                 -959.7      151.1  -6.353 3.65e-05 ***
VARIETYHKE:AGE26–35     474.6      213.6   2.222 0.046281 *
VARIETYPhiE:AGE26–35    435.2      213.6   2.037 0.064288 .
VARIETYHKE:AGE36–50     424.7      213.6   1.988 0.070118 .
VARIETYPhiE:AGE36–50    567.8      213.6   2.658 0.020877 *
VARIETYHKE:AGE>50       795.1      213.6   3.722 0.002917 **
VARIETYPhiE:AGE>50     1022.3      213.6   4.786 0.000444 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 151.1 on 12 degrees of freedom
Multiple R-squared:  0.8935,    Adjusted R-squared:  0.7958
F-statistic: 9.149 on 11 and 12 DF,  p-value: 0.0003006

2 Problems of the hypotheses/operationalizations

2.1 Lack of precision

Figure 1: Four hypothetical results for AGE:GENDER:VARIETY

2.2 What do between-variety frequencies reveal?

3 Problems of the statistical modeling

3.1 A few minor mistakes

3.1.1 Errors in the data

   CASE PARTICLE VARIETY GENDER   AGE SPKRS TOKENS WORDS       TMW
22 C022     also    PhiE   male 26–35    75      5 17283 289.30163
46 C046     only    PhiE   male 26–35    78      1 17283  57.86033
   CASE PARTICLE VARIETY GENDER AGE SPKRS TOKENS WORDS      TMW
20 C020     also    PhiE female >50     0      2  3613 553.5566

3.1.2 Under-reporting and -analyzing

3.1.3 Over-reporting

3.2 Four fundamental issues

3.2.1 Issue 1: no level-1 predictors

3.2.1.1 General introduction

3.2.1.2 Application to P&F

3.2.2 Issue 2: two models on one data set

3.2.2.1 General introduction

model.AmE <- glm(CONSTRUCTION ~ LENGTHDO + LITERAL + DIRECTIONALPP, data=DIALECT=="AmE")
model.BrE <- glm(CONSTRUCTION ~ LENGTHDO + LITERAL + DIRECTIONALPP, data=DIALECT=="BrE")
model.bothE <- glm(CONSTRUCTION ~ DIALECT * (LENGTHDO + LITERAL + DIRECTIONALPP))

3.2.2.2 Application to P&F

summary(m.final.all <- stepAIC(
   # note the new data argument: all the data from both particles now
   lm(TMW ~ 1, data=d),
   # bidirectional model selection ...
   direction="both",
   # between these 2 extremes: lower null & the new upper/full model:
   scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY*PARTICLE),
   trace=0)) # do not show all steps

3.2.3 Issue 3: not weighting observations

3.2.3.1 General introduction

3.2.3.2 Application to P&F

dd$also[c(1, 3, 5),-8]
  CASE VARIETY GENDER   AGE SPKRS TOKENS  WORDS      TMW
1 C001     HKE female 14–25    84     45 107796 417.4552
3 C003     HKE female 36–50     6      0   5888   0.0000
5 C005     HKE   male 14–25    15      5  16599 301.2230
d.also.expanded <- as.data.frame( # make d.also.expanded a data frame of
   lapply(dd$also[,-5],           # applying to dd$also (w/out column 5)
   # an anonym. function that repeats each value in each column SPKRS times:
   \(af) rep(af, dd$also[,5])))

3.2.4 Issue 4: potentially wrong regression model

4 Re-analyzing their data

4.1 Methods

# this could be fitted on all level-1 uses of also/only, clause-final or not:
m.required <- glmer(
   CLAUSEFINAL ~                # binary response: no vs. yes
   1                          + # intercept
   PARTICLE*AGE*SEX*VARIETY   + # predictors & interactions
   # minimally necessary random-effects structure (more is likely needed)
   (1|SPEAKER),                 # speakers
   family=binomial,             # the response is binary
   data=unaggregated.dataframe) # a new unaggregated data of all also/only
# Here are the contrast-coding changes:
# expanding the data
d.expanded <- as.data.frame( # make d.expanded a data frame of
   lapply(d[,-6],            # applying to d (w/out column 6)
   # an anonym. function that repeats each value in each column SPKRS times:
   \(af) rep(af, d[,6])))
# making AGE 'more ordinal'
contrasts(d.expanded$AGE) <- contr.sdif
# planned contrasts for VARIETY
ind_vs_other <- c(2/3, -1/3, -1/3)
hke_vs_phi   <- c(0  ,  1/2, -1/2)
contrasts(d.expanded$VARIETY) <- cbind(ind_vs_other, hke_vs_phi)
summary(m.all.expanded <- lm(
   TMW ~ 1 + PARTICLE * (AGE + GENDER + VARIETY)^2,
   data=d.expanded))

Call:
lm(formula = TMW ~ 1 + PARTICLE * (AGE + GENDER + VARIETY)^2,
    data = d.expanded)

Residuals:
     Min       1Q   Median       3Q      Max
-167.840  -26.955    4.813   23.663  240.163

Coefficients:
                                                Estimate Std. Error t value Pr(>|t|)
(Intercept)                                      622.757     11.746  53.019 < 2e-16 ***
PARTICLEonly                                    -366.174     16.611 -22.044 < 2e-16 ***
AGE26–35-14–25                                  -329.911     13.127 -25.132 < 2e-16 ***
AGE36–50-26–35                                   115.089     15.701   7.330 3.47e-13 ***
AGE>50-36–50                                    -125.960     44.433  -2.835 0.00464 **
GENDERmale                                      -140.878      9.435 -14.932 < 2e-16 ***
VARIETYind_vs_other                             1311.489     16.620  78.909 < 2e-16 ***
VARIETYhke_vs_phi                               -259.573     29.130  -8.911 < 2e-16 ***
AGE26–35-14–25:GENDERmale                        247.732     13.166  18.815 < 2e-16 ***
AGE36–50-26–35:GENDERmale                        207.769     13.876  14.973 < 2e-16 ***
AGE>50-36–50:GENDERmale                         -217.169     29.864  -7.272 5.28e-13 ***
AGE26–35-14–25:VARIETYind_vs_other              -163.289     21.384  -7.636 3.63e-14 ***
AGE36–50-26–35:VARIETYind_vs_other               -43.210     24.007  -1.800 0.07205 .
AGE>50-36–50:VARIETYind_vs_other                 583.758     54.798  10.653 < 2e-16 ***
AGE26–35-14–25:VARIETYhke_vs_phi                -359.560     34.136 -10.533  < 2e-16 ***
AGE36–50-26–35:VARIETYhke_vs_phi                -425.201     40.348 -10.538  < 2e-16 ***
AGE>50-36–50:VARIETYhke_vs_phi                   739.750    105.090   7.039 2.75e-12 ***
GENDERmale:VARIETYind_vs_other                  -590.630     14.789 -39.936 < 2e-16 ***
GENDERmale:VARIETYhke_vs_phi                       7.251     19.437   0.373 0.70917
PARTICLEonly:AGE26–35-14–25                      193.176     18.563  10.406 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35                     -177.554     22.203  -7.997 2.27e-15 ***
PARTICLEonly:AGE>50-36–50                       -175.959     62.837  -2.800 0.00516 **
PARTICLEonly:GENDERmale                           95.218     13.340   7.138 1.38e-12 ***
PARTICLEonly:VARIETYind_vs_other                -861.414     23.504 -36.649 < 2e-16 ***
PARTICLEonly:VARIETYhke_vs_phi                   319.783     41.195   7.763 1.39e-14 ***
PARTICLEonly:AGE26–35-14–25:GENDERmale          -216.984     18.583 -11.676 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35:GENDERmale          -345.374     19.591 -17.630 < 2e-16 ***
PARTICLEonly:AGE>50-36–50:GENDERmale             616.900     42.232  14.607 < 2e-16 ***
PARTICLEonly:AGE26–35-14–25:VARIETYind_vs_other -206.081     30.240  -6.815 1.29e-11 ***
PARTICLEonly:AGE36–50-26–35:VARIETYind_vs_other    1.377     33.948   0.041 0.96766
PARTICLEonly:AGE>50-36–50:VARIETYind_vs_other   -962.302     77.496 -12.417 < 2e-16 ***
PARTICLEonly:AGE26–35-14–25:VARIETYhke_vs_phi    427.858     48.272   8.864 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35:VARIETYhke_vs_phi    179.836     57.056   3.152 0.00165 **
PARTICLEonly:AGE>50-36–50:VARIETYhke_vs_phi     -847.091    148.619  -5.700 1.40e-08 ***
PARTICLEonly:GENDERmale:VARIETYind_vs_other      629.971     20.911  30.126 < 2e-16 ***
PARTICLEonly:GENDERmale:VARIETYhke_vs_phi       -135.348     27.487  -4.924 9.26e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 71.97 on 1781 degrees of freedom
Multiple R-squared:  0.9731,    Adjusted R-squared:  0.9725
F-statistic:  1838 on 35 and 1781 DF,  p-value: < 2.2e-16
drop1(m.all.expanded, test="F")
Single term deletions

Model:
TMW ~ 1 + PARTICLE * (AGE + GENDER + VARIETY)^2
                        Df Sum of Sq      RSS   AIC F value    Pr(>F)
<none>                                9224376 15575
PARTICLE:AGE:GENDER      3   5115524 14339900 16371 329.227 < 2.2e-16 ***
PARTICLE:AGE:VARIETY     6   2963271 12187648 16070  95.356 < 2.2e-16 ***
PARTICLE:GENDER:VARIETY  2   5551651 14776027 16428 535.944 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.2 Selected results

Figure 2: The interaction PARTICLE:GENDER:AGE
Figure 3: The interaction PARTICLE:GENDER:VARIETY

5 Concluding remarks

6 References

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] MASS_7.3-60.2  lme4_1.1-35.3  Matrix_1.7-0   effects_4.2-2  car_3.1-2
[6] carData_3.0-5  STGmisc_1.0    Rcpp_1.0.12    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] nlme_3.1-164       cli_3.6.2          knitr_1.46         rlang_1.1.3
 [5] xfun_0.44          estimability_1.5.1 DBI_1.2.2          mitools_2.4
 [9] survey_4.4-2       minqa_1.2.7        jsonlite_1.8.8     colorspace_2.1-0
[13] nnet_7.3-19        htmltools_0.5.8.1  rmarkdown_2.27     grid_4.4.0
[17] evaluate_0.23      abind_1.4-5        fastmap_1.2.0      yaml_2.3.8
[21] insight_0.19.11    htmlwidgets_1.6.4  rstudioapi_0.16.0  lattice_0.22-6
[25] digest_0.6.35      nloptr_2.0.3       splines_4.4.0      tools_4.4.0
[29] boot_1.3-30        survival_3.6-4