On proper regression modeling in varieties research
Abstract
One particularly prominent methodological development in linguistics is what has been termed the “quantitative turn”: Not only are more and more studies using statistical tools to explore data and to test hypotheses, the complexity of the statistical methods employed is growing as well. This development is particularly prominent in all kinds of corpus-linguistic studies: 20 years ago chi-squared tests, t-tests, and Pearson’s r reigned supreme, but now more and more corpus studies are using multivariate exploratory tools and, for hypothesis testing, multifactorial predictive modeling techniques, in particular regression models (and, increasingly, tree-based methods). However welcome this development is, it, and especially its pace as well as the fact that few places offer rigorous training in statistical methods, comes with its own risks, chief among them that analytical methods are misapplied, which can lead imprecise, incomplete, or wrong analyses. In this paper, I will revisit a recent regression-analytic study in the research area of English varieties (Parviainen & Fuchs 2018 on clause-final also and only in three Asian Englishes) to:
- highlight in particular three fundamental yet frequent mistakes that it exemplifies;
- discuss why and how each of these mistakes should be addressed;
- reanalyze the data (as far as is possible with what is available) and show briefly how that affects the analysis’s results and interpretation.
1 Parviainen & Fuchs (2018): what is explored and how
Their analyses are based on the private-conversation parts of the studied components, for which relevant metadata for the speakers are available. They retrieved all instances of clause-final also/only from these corpus parts and put together a data frame shown here as Table 1 (their Appendix 1 with two cosmetic changes and my bolding to be explained below):
CASE | PARTICLE | VARIETY | GENDER | AGE | SPKRS | TOKENS | WORDS | TMWpaper |
---|---|---|---|---|---|---|---|---|
C001 | also | HKE | female | 14–25 | 84 | 45 | 107796 | 417 |
C002 | also | HKE | female | 26–35 | 3 | 0 | 5304 | 0 |
C003 | also | HKE | female | 36–50 | 6 | 0 | 5888 | 0 |
C004 | also | HKE | female | >50 | 0 | 0 | 0 | 0 |
C005 | also | HKE | male | 14–25 | 15 | 5 | 16599 | 301 |
C006 | also | HKE | male | 26–35 | 2 | 0 | 2892 | 0 |
C007 | also | HKE | male | 36–50 | 4 | 0 | 3204 | 0 |
C008 | also | HKE | male | >50 | 1 | 0 | 902 | 0 |
C009 | also | IndE | female | 14–25 | 52 | 74 | 43410 | 1705 |
C010 | also | IndE | female | 26–35 | 33 | 33 | 29456 | 1120 |
C011 | also | IndE | female | 36–50 | 27 | 40 | 25282 | 1582 |
C012 | also | IndE | female | >50 | 9 | 11 | 6757 | 1628 |
C013 | also | IndE | male | 14–25 | 18 | 14 | 14236 | 983 |
C014 | also | IndE | male | 26–35 | 26 | 24 | 25082 | 957 |
C015 | also | IndE | male | 36–50 | 37 | 31 | 34771 | 892 |
C016 | also | IndE | male | >50 | 37 | 35 | 31927 | 1096 |
C017 | also | PhiE | female | 14–25 | 170 | 37 | 93197 | 397 |
C018 | also | PhiE | female | 26–35 | 78 | 9 | 25083 | 359 |
C019 | also | PhiE | female | 36–50 | 54 | 6 | 11674 | 514 |
C020 | also | PhiE | female | >50 | 0 | 2 | 3613 | 554 |
C021 | also | PhiE | male | 14–25 | 69 | 6 | 32386 | 185 |
C022 | also | PhiE | male | 26–35 | 75 | 5 | 17283 | 289 |
C023 | also | PhiE | male | 36–50 | 106 | 6 | 6190 | 969 |
C024 | also | PhiE | male | >50 | 1 | 0 | 3937 | 0 |
C025 | only | HKE | female | 14–25 | 84 | 29 | 107796 | 269 |
C026 | only | HKE | female | 26–35 | 3 | 2 | 5304 | 377 |
C027 | only | HKE | female | 36–50 | 6 | 0 | 5888 | 0 |
C028 | only | HKE | female | >50 | 0 | 0 | 0 | 0 |
C029 | only | HKE | male | 14–25 | 15 | 1 | 16599 | 60 |
C030 | only | HKE | male | 26–35 | 2 | 0 | 2892 | 0 |
C031 | only | HKE | male | 36–50 | 4 | 0 | 3204 | 0 |
C032 | only | HKE | male | >50 | 1 | 0 | 902 | 0 |
C033 | only | IndE | female | 14–25 | 52 | 41 | 43410 | 944 |
C034 | only | IndE | female | 26–35 | 33 | 20 | 29456 | 679 |
C035 | only | IndE | female | 36–50 | 27 | 17 | 25282 | 672 |
C036 | only | IndE | female | >50 | 9 | 0 | 6757 | 0 |
C037 | only | IndE | male | 14–25 | 18 | 17 | 14236 | 1194 |
C038 | only | IndE | male | 26–35 | 26 | 14 | 25082 | 558 |
C039 | only | IndE | male | 36–50 | 37 | 10 | 34771 | 288 |
C040 | only | IndE | male | >50 | 37 | 7 | 31927 | 219 |
C041 | only | PhiE | female | 14–25 | 170 | 12 | 93197 | 129 |
C042 | only | PhiE | female | 26–35 | 78 | 1 | 25083 | 40 |
C043 | only | PhiE | female | 36–50 | 54 | 1 | 11674 | 86 |
C044 | only | PhiE | female | >50 | 0 | 0 | 3613 | 0 |
C045 | only | PhiE | male | 14–25 | 69 | 0 | 32386 | 0 |
C046 | only | PhiE | male | 26–35 | 78 | 1 | 17283 | 58 |
C047 | only | PhiE | male | 36–50 | 106 | 0 | 6190 | 0 |
C048 | only | PhiE | male | >50 | 1 | 1 | 3937 | 254 |
summary(m.final.also <- stepAIC(
lm(TMW ~ 1, data=dd$also),
# bidirectional model selection ...
direction="both",
# between these two extremes: lower null & upper full model:
scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY),
trace=0)) # do not show all steps
Call:
lm(formula = TMW ~ VARIETY + GENDER + VARIETY:GENDER, data = dd$also)
Residuals:
Min 1Q Median 3Q Max
-388.46 -98.86 -65.25 101.85 608.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1508.8 117.4 12.856 1.65e-10 ***
VARIETYHKE -1404.4 166.0 -8.462 1.09e-07 ***
VARIETYPhiE -1052.9 166.0 -6.344 5.61e-06 ***
GENDERmale -526.8 166.0 -3.174 0.00526 **
VARIETYHKE:GENDERmale 497.7 234.7 2.120 0.04813 *
VARIETYPhiE:GENDERmale 431.9 234.7 1.840 0.08232 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 234.7 on 18 degrees of freedom
Multiple R-squared: 0.8635, Adjusted R-squared: 0.8256
F-statistic: 22.78 on 5 and 18 DF, p-value: 3.33e-07
summary(m.final.only <- stepAIC(
lm(TMW ~ 1, data=dd$only), direction="both",
scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY),
trace=0))
Call:
lm(formula = TMW ~ VARIETY + AGE + VARIETY:AGE, data = dd$only)
Residuals:
Min 1Q Median 3Q Max
-192.41 -74.38 0.00 74.38 192.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1069.3 106.8 10.011 3.54e-07 ***
VARIETYHKE -904.7 151.1 -5.989 6.32e-05 ***
VARIETYPhiE -1004.9 151.1 -6.653 2.35e-05 ***
AGE26–35 -450.7 151.1 -2.984 0.011401 *
AGE36–50 -589.3 151.1 -3.901 0.002105 **
AGE>50 -959.7 151.1 -6.353 3.65e-05 ***
VARIETYHKE:AGE26–35 474.6 213.6 2.222 0.046281 *
VARIETYPhiE:AGE26–35 435.2 213.6 2.037 0.064288 .
VARIETYHKE:AGE36–50 424.7 213.6 1.988 0.070118 .
VARIETYPhiE:AGE36–50 567.8 213.6 2.658 0.020877 *
VARIETYHKE:AGE>50 795.1 213.6 3.722 0.002917 **
VARIETYPhiE:AGE>50 1022.3 213.6 4.786 0.000444 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 151.1 on 12 degrees of freedom
Multiple R-squared: 0.8935, Adjusted R-squared: 0.7958
F-statistic: 9.149 on 11 and 12 DF, p-value: 0.0003006
2 Problems of the hypotheses/operationalizations
2.1 Lack of precision
2.2 What do between-variety frequencies reveal?
3 Problems of the statistical modeling
3.1 A few minor mistakes
3.1.1 Errors in the data
CASE PARTICLE VARIETY GENDER AGE SPKRS TOKENS WORDS TMW
22 C022 also PhiE male 26–35 75 5 17283 289.30163
46 C046 only PhiE male 26–35 78 1 17283 57.86033
CASE PARTICLE VARIETY GENDER AGE SPKRS TOKENS WORDS TMW
20 C020 also PhiE female >50 0 2 3613 553.5566
3.1.2 Under-reporting and -analyzing
3.1.3 Over-reporting
3.2 Four fundamental issues
3.2.1 Issue 1: no level-1 predictors
3.2.1.1 General introduction
3.2.1.2 Application to P&F
3.2.2 Issue 2: two models on one data set
3.2.2.1 General introduction
3.2.2.2 Application to P&F
summary(m.final.all <- stepAIC(
# note the new data argument: all the data from both particles now
lm(TMW ~ 1, data=d),
# bidirectional model selection ...
direction="both",
# between these 2 extremes: lower null & the new upper/full model:
scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY*PARTICLE),
trace=0)) # do not show all steps
3.2.3 Issue 3: not weighting observations
3.2.3.1 General introduction
3.2.3.2 Application to P&F
CASE VARIETY GENDER AGE SPKRS TOKENS WORDS TMW
1 C001 HKE female 14–25 84 45 107796 417.4552
3 C003 HKE female 36–50 6 0 5888 0.0000
5 C005 HKE male 14–25 15 5 16599 301.2230
3.2.4 Issue 4: potentially wrong regression model
4 Re-analyzing their data
4.1 Methods
# this could be fitted on all level-1 uses of also/only, clause-final or not:
m.required <- glmer(
CLAUSEFINAL ~ # binary response: no vs. yes
1 + # intercept
PARTICLE*AGE*SEX*VARIETY + # predictors & interactions
# minimally necessary random-effects structure (more is likely needed)
(1|SPEAKER), # speakers
family=binomial, # the response is binary
data=unaggregated.dataframe) # a new unaggregated data of all also/only
# Here are the contrast-coding changes:
# expanding the data
d.expanded <- as.data.frame( # make d.expanded a data frame of
lapply(d[,-6], # applying to d (w/out column 6)
# an anonym. function that repeats each value in each column SPKRS times:
\(af) rep(af, d[,6])))
# making AGE 'more ordinal'
contrasts(d.expanded$AGE) <- contr.sdif
# planned contrasts for VARIETY
ind_vs_other <- c(2/3, -1/3, -1/3)
hke_vs_phi <- c(0 , 1/2, -1/2)
contrasts(d.expanded$VARIETY) <- cbind(ind_vs_other, hke_vs_phi)
Call:
lm(formula = TMW ~ 1 + PARTICLE * (AGE + GENDER + VARIETY)^2,
data = d.expanded)
Residuals:
Min 1Q Median 3Q Max
-167.840 -26.955 4.813 23.663 240.163
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 622.757 11.746 53.019 < 2e-16 ***
PARTICLEonly -366.174 16.611 -22.044 < 2e-16 ***
AGE26–35-14–25 -329.911 13.127 -25.132 < 2e-16 ***
AGE36–50-26–35 115.089 15.701 7.330 3.47e-13 ***
AGE>50-36–50 -125.960 44.433 -2.835 0.00464 **
GENDERmale -140.878 9.435 -14.932 < 2e-16 ***
VARIETYind_vs_other 1311.489 16.620 78.909 < 2e-16 ***
VARIETYhke_vs_phi -259.573 29.130 -8.911 < 2e-16 ***
AGE26–35-14–25:GENDERmale 247.732 13.166 18.815 < 2e-16 ***
AGE36–50-26–35:GENDERmale 207.769 13.876 14.973 < 2e-16 ***
AGE>50-36–50:GENDERmale -217.169 29.864 -7.272 5.28e-13 ***
AGE26–35-14–25:VARIETYind_vs_other -163.289 21.384 -7.636 3.63e-14 ***
AGE36–50-26–35:VARIETYind_vs_other -43.210 24.007 -1.800 0.07205 .
AGE>50-36–50:VARIETYind_vs_other 583.758 54.798 10.653 < 2e-16 ***
AGE26–35-14–25:VARIETYhke_vs_phi -359.560 34.136 -10.533 < 2e-16 ***
AGE36–50-26–35:VARIETYhke_vs_phi -425.201 40.348 -10.538 < 2e-16 ***
AGE>50-36–50:VARIETYhke_vs_phi 739.750 105.090 7.039 2.75e-12 ***
GENDERmale:VARIETYind_vs_other -590.630 14.789 -39.936 < 2e-16 ***
GENDERmale:VARIETYhke_vs_phi 7.251 19.437 0.373 0.70917
PARTICLEonly:AGE26–35-14–25 193.176 18.563 10.406 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35 -177.554 22.203 -7.997 2.27e-15 ***
PARTICLEonly:AGE>50-36–50 -175.959 62.837 -2.800 0.00516 **
PARTICLEonly:GENDERmale 95.218 13.340 7.138 1.38e-12 ***
PARTICLEonly:VARIETYind_vs_other -861.414 23.504 -36.649 < 2e-16 ***
PARTICLEonly:VARIETYhke_vs_phi 319.783 41.195 7.763 1.39e-14 ***
PARTICLEonly:AGE26–35-14–25:GENDERmale -216.984 18.583 -11.676 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35:GENDERmale -345.374 19.591 -17.630 < 2e-16 ***
PARTICLEonly:AGE>50-36–50:GENDERmale 616.900 42.232 14.607 < 2e-16 ***
PARTICLEonly:AGE26–35-14–25:VARIETYind_vs_other -206.081 30.240 -6.815 1.29e-11 ***
PARTICLEonly:AGE36–50-26–35:VARIETYind_vs_other 1.377 33.948 0.041 0.96766
PARTICLEonly:AGE>50-36–50:VARIETYind_vs_other -962.302 77.496 -12.417 < 2e-16 ***
PARTICLEonly:AGE26–35-14–25:VARIETYhke_vs_phi 427.858 48.272 8.864 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35:VARIETYhke_vs_phi 179.836 57.056 3.152 0.00165 **
PARTICLEonly:AGE>50-36–50:VARIETYhke_vs_phi -847.091 148.619 -5.700 1.40e-08 ***
PARTICLEonly:GENDERmale:VARIETYind_vs_other 629.971 20.911 30.126 < 2e-16 ***
PARTICLEonly:GENDERmale:VARIETYhke_vs_phi -135.348 27.487 -4.924 9.26e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 71.97 on 1781 degrees of freedom
Multiple R-squared: 0.9731, Adjusted R-squared: 0.9725
F-statistic: 1838 on 35 and 1781 DF, p-value: < 2.2e-16
Single term deletions
Model:
TMW ~ 1 + PARTICLE * (AGE + GENDER + VARIETY)^2
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 9224376 15575
PARTICLE:AGE:GENDER 3 5115524 14339900 16371 329.227 < 2.2e-16 ***
PARTICLE:AGE:VARIETY 6 2963271 12187648 16070 95.356 < 2.2e-16 ***
PARTICLE:GENDER:VARIETY 2 5551651 14776027 16428 535.944 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.2 Selected results
5 Concluding remarks
6 References
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] MASS_7.3-60.2 lme4_1.1-35.3 Matrix_1.7-0 effects_4.2-2 car_3.1-2
[6] carData_3.0-5 STGmisc_1.0 Rcpp_1.0.12 magrittr_2.0.3
loaded via a namespace (and not attached):
[1] nlme_3.1-164 cli_3.6.2 knitr_1.46 rlang_1.1.3
[5] xfun_0.44 estimability_1.5.1 DBI_1.2.2 mitools_2.4
[9] survey_4.4-2 minqa_1.2.7 jsonlite_1.8.8 colorspace_2.1-0
[13] nnet_7.3-19 htmltools_0.5.8.1 rmarkdown_2.27 grid_4.4.0
[17] evaluate_0.23 abind_1.4-5 fastmap_1.2.0 yaml_2.3.8
[21] insight_0.19.11 htmlwidgets_1.6.4 rstudioapi_0.16.0 lattice_0.22-6
[25] digest_0.6.35 nloptr_2.0.3 splines_4.4.0 tools_4.4.0
[29] boot_1.3-30 survival_3.6-4