Ling 104, session 06: disp & means (g-o-f) (key)

Author
Affiliation

UC Santa Barbara & JLU Giessen

Published

05 Jan 2025 12-34-56

Load the file _input/partplacement.csv into a data frame d. This file contains data from a corpus study on the alternation of particle placement that was introduced in Section 1.3; you can find information about this data set in _input/partplacement.r.

rm(list=ls(all=TRUE)); library(magrittr)
summary(d <- read.delim(
   "_input/partplacement.csv",
   stringsAsFactors=TRUE))
      CASE          CONSTRUCTION     MEDIUM       DO_COMPLX     DO_LENSYLL
 Min.   :  1.00   v_do_prt:100   spoken :100   clausmod:  6   Min.   : 1.00
 1st Qu.: 50.75   v_prt_do:100   written:100   phrasmod: 67   1st Qu.: 2.00
 Median :100.50                                simple  :127   Median : 3.00
 Mean   :100.50                                               Mean   : 4.72
 3rd Qu.:150.25                                               3rd Qu.: 6.00
 Max.   :200.00                                               Max.   :31.00
      DO_ANIM        DO_CONC      PP
 animate  : 27   abstract: 95   no :167
 inanimate:173   concrete:105   yes: 33



                                         

1 Exercise 01

Question: Last time, we saw that the distributions of the lengths of the DOs of the verb-particle constructions differ across the two constructions (using the Kolmogorov-Smirnov test). You now want to test whether you can pinpoint what exactly is responsible for this result – their central tendency? their dispersion? something else? – so we try to answer the question whether the lengths of the direct objects of the two verb-particle constructions differ in their dispersions.

1.1 Hypotheses

The

  • dependent/response variables is DO_LENSYLL;
  • independent/predictor variable is CONSTRUCTION.

What are the hypotheses?

  • text hypotheses:
    • H1: The variances of the DOs’ lengths differ between the two verb-particle constructions;
    • H0: The variances of the DOs’ lengths don’t differ between the two verb-particle constructions;
  • statistical hypotheses:
    • H1: F≠1;
    • H0: F=1.

1.2 Descriptive stats/visualization

We first describe and visualize the data:

(qwe <- tapply(    # apply to
   d$DO_LENSYLL,   # these values
   d$CONSTRUCTION, # a grouping by these values
   var))           # then apply var to each group
 v_do_prt  v_prt_do
 2.917576 24.120808 
(observed_F <- "/"(  # divide
   qwe["v_do_prt"],  # the smaller F-value by
   qwe["v_prt_do"])) # the larger F-value
 v_do_prt
0.1209568 
stripchart( # stripchart: lengths against the construction
   jitter(d$DO_LENSYLL) ~ d$CONSTRUCTION,
   method="jitter", pch=16, col="#0000FF20"); grid()
# add the means to the plot (bec they are used to compute variances)
points(pch=4, cex=2,
   x=tapply(d$DO_LENSYLL, d$CONSTRUCTION, mean),
   y=1:2)

It certainly doesn’t seem like F could really be 1: there’s a big difference in the width of the spread

1.3 Statistical testing

For this question, we would prefer to use an F-test for variance homogeneity, which requires normality of the two groups to be compared, which we already know isn’t the case:

tapply(            # apply to 
   d$DO_LENSYLL,   # these values
   d$CONSTRUCTION, # a grouping by these values
   # and run lillie.test on each group but as an anonymous function
   # so we can have R return only the p-value & nothing else
   FUN=\(af) nortest::lillie.test(af)$p.value)
    v_do_prt     v_prt_do
4.519140e-15 1.897841e-07 

Thus, we cannot do the F-test – we have to fall back on the Fligner-Killeen test instead, which we can either employ with the formula notation …

fligner.test(DO_LENSYLL~CONSTRUCTION, data=d)

    Fligner-Killeen test of homogeneity of variances

data:  DO_LENSYLL by CONSTRUCTION
Fligner-Killeen:med chi-squared = 44.174, df = 1, p-value = 3.004e-11

… or the vector notation:

fligner.test(d$DO_LENSYLL, d$CONSTRUCTION)

    Fligner-Killeen test of homogeneity of variances

data:  d$DO_LENSYLL and d$CONSTRUCTION
Fligner-Killeen:med chi-squared = 44.174, df = 1, p-value = 3.004e-11

1.4 Write-up

[Show stripchart.] The variances of the DOs’ lengths in both construction are very different: for v_do_prt the variance is 2.9, for v_prt_do it is 24.1, which means F is ≈0.12 (8.27). Given non-normality of both distributions of DO lengths (as per Lilliefors tests), an F-test for variance homogeneity was not possible, which is why a Fligner-Killeen test was conducted. According to that test, the variances of the DOs’ lengths are significantly different from each (Χ2=44.174, df=1, p<10-10).

1.5 Excursus

How might one test this using a simulation approach? One way would be to generate null hypothesis data by (i) destroying the correlation between the columns/variables DO_LENSYLL and CONSTRUCTION by randomly reordering one of the columns/variables, (ii) computing the variances and F-value we get for these data, and (iii) doing this thousands of times; the result is a distribution of F-values for when the null hypothesis is true:

set.seed(123) # set a random number generator
sampled_Fs <- numeric(10000) # generate a collector vector
for (i in 1:10000) { # do the following 10000 times
   randomized_constr <- sample(d$CONSTRUCTION)
   sampled_vars <- tapply( # apply to 
      d$DO_LENSYLL,        # these values
      randomized_constr,   # a grouping by these values
      var)                 # then apply var to each group
    sampled_Fs[i] <- "/"(        # divide
       sampled_vars["v_do_prt"], # the F-value for this construction
       sampled_vars["v_prt_do"]) # the F-value for the other
}
quantile(sampled_Fs, probs=c(0, 0.025, 0.5, 0.975, 1))
       0%      2.5%       50%     97.5%      100%
0.2765383 0.4581894 1.0058151 2.1918747 3.5066277 
hist(sampled_Fs, xlim=c(0.1, 10)) # plot a histogram of the sampled F-values w/ this x-axis
   abline(v=c(observed_F, 1/observed_F), lty=2) # add vertical lines at the observed F-values

In the verb-particle construction data, the F-ratio resulting from the variances of the DOs’ lengths in both constructions was 0.121 (8.267). In a simulation where the lengths were randomly assigned to the two constructions (10,000 iterations), not a single value obtained deviated that much from the null hypothesis expectation of 1: the minimal value ever obtained was 0.277; the maximal value ever obtained was 3.507. Thus, the observed ratio of the variances is highly significantly different from 1 (p<0.0001).

Another way to use a simulation approach would be to use bootstrapping (here, simple percentile bootstrapping) to compute a confidence interval for the F-value we computed. We could do so by (i) sampling as many data points as we have from the data points that we have but with replacement, (ii) computing the variances and F-value we get for these data, and (iii) doing this thousands of times; the result is a distribution of F-values for samples that are similar to ours, samples we might have gotten if our data had been slightly different.

In the verb-particle construction data, the F-ratio resulting from the variances of the DOs’ lengths in both constructions was 0.12. The 95% confidence interval obtained by percentile bootstrapping (10,000 iterations) for this F-value is [0.063, 0.231] and the maximal F-value of all iterations was 0.361, meaning there are no values coming even close to the F-value of 1 that would be expected from the null hypothesis. [Better bootstrapping methods that I am not discussing here lead to results that, for all practical intents and purposes are the same; the CI there is [0.0584, 0.2210].]

2 Exercise 02

Question: You come across a study of verb-particle constructions that doesn’t use native speaker data like you do, but data from advanced learners of English (with a Finnish L1 background). In that study, the variance of the learners’ DO lengths in all constructions is 14.32 – is that amount of variability significantly different from the one in your study? In other words, are the advanced learners using as wide a range of (lengths of) DOs as your native speakers?

2.1 Hypotheses

The

  • dependent/response variable is DO_LENSYLL;
  • independent/predictor variable is none because we are not considering any other variables as ‘determining’ the behavior of DO_LENSYLL.

What are the hypotheses?

  • text hypotheses:
    • H1: The variance of the non-native speakers’ DO lengths is different from the variance of your native speakers’ DO lengths;
    • H0: The variance of the non-native speakers’ DO lengths is not different from the variance of your native speakers’ DO lengths;
  • statistical hypotheses:
    • H1: Χ2>0;
    • H0: Χ2=0.

2.2 Descriptive stats/visualization

We first describe the data:

var(d$DO_LENSYLL)
[1] 18.58452
stripchart(
   d$DO_LENSYLL, method="jitter",
   xlab="DO lengths in syllables",
   pch=16, col="#00FF0020"); grid()
# add the mean to the plot (bec it is used to compute the variance)
points(pch=4, cex=2,
   x=mean(d$DO_LENSYLL),
   y=1)

2.3 Statistical testing

For this question, we would prefer to use a chi-squared test for goodness-of-fit, which requires normality of the values involved, which we already know isn’t the case:

nortest::lillie.test(d$DO_LENSYLL)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  d$DO_LENSYLL
D = 0.19409, p-value < 2.2e-16

Now what … The book doesn’t mention a test that can be used when the variable in question is non-normal! (And I actually don’t know one.) That means, now we really have to do a simulation study. Here’s one approach one might try: We draw 10,000 samples with replacement from DO_LENSYLL and compute the variance for each. Then, we proceed as with confidence intervals and check whether the Finnish learner study’s variance is within the 95% confidence interval of our resampled data.

set.seed(123) # set a random number generator
sampled_vars <- rep(NA, 10000) # generate a collector vector (way 2)
for (i in 1:10000) { # do something 10K times, namely
   # draw a random sample (w/ replacement) from the DO lengths:
   sampled_DOLEN <- sample(d$DO_LENSYLL, replace=TRUE)
   # store the variance of that sample
   sampled_vars[i] <- var(sampled_DOLEN)
}
(qwe <- quantile(sampled_vars, probs=c(0.025, 0.5, 0.975)))
    2.5%      50%    97.5%
12.06967 18.07857 27.35376 

And then we plot and ‘test’:

hist(sampled_vars, breaks=30, col="#FF000020",
     main="", xlab="Sampled variances")
abline(v=qwe, col="red")
abline(v=14.32, lty=3, lwd=3, col="blue")

In fact, it is easy to determine how many of the sampled variances are smaller than or equal to the 14.32 from the other study:

mean(sampled_vars<=14.32) # or the more advanced way:
[1] 0.1375
# quantiles_sampled_vars <- ecdf(sampled_vars)
# quantiles_sampled_vars(14.32)

2.4 Write-up

The variance of the DOs’ lengths in the current verb-particle construction data is ≈18.58. To compare this variance to that of the Finnish learner corpus data of 14.32, a chi-squared test for goodness-of-fit was considered, but was not permitted because, according to a Lilliefors test, the DO lengths in the current data are not normally distributed (p<10-15). Instead, a confidence interval was computed from the means of 10,000 bootstrapped versions of the current DO lengths and the corresponding 2.5% and 97.5% quantiles (12.07 and 27.35 respectively). Since the variance of the Finnish learners falls within that 95% confidence interval – the variance of the learners is greater than or equal to 13.75% of the sampled variances – it is concluded to not be significantly different from the present data set: The learners are using as wide a range of (lengths of) DOs as the native speakers. [As above, better bootstrapping methods that I am not discussing here lead to the same conclusion that 14.32 is within the relevant 95% confidence interval of [13.4, 32.2].]

3 Exercise 03

Question: You come across a study of verb-particle constructions that doesn’t use native speaker data like you do, but data from advanced learners of English (with a Finnish L1 background). In that study, the mean of the learners’ DO lengths in all constructions is 5.8 (with a median of 6) – is that average length different from the one in your study? In other words, are the advanced learners using the same average length of DOs as your native speakers?

3.1 Hypotheses

The

  • dependent/response variable is DO_LENSYLL;
  • independent/predictor variable is none because we are not considering any other variables as ‘determining’ the behavior of DO_LENSYLL.

What are the hypotheses?

  • text hypotheses:
    • H1: The mean length of the non-native speakers’ DOs is different from the mean length of your native speakers’ DOs;
    • H0: The mean length of the non-native speakers’ DOs is not different from the mean length of your native speakers’ DOs;
  • statistical hypotheses:
    • H1: t≠0;
    • H0: t=0.

3.2 Descriptive stats/visualization

We first describe the data, which by now we have done super often:

summary(d$DO_LENSYLL)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1.00    2.00    3.00    4.72    6.00   31.00 
stripchart(
   d$DO_LENSYLL, method="jitter",
   xlab="DO lengths in syllables",
   pch=16, col="#FF008F20"); grid()
# add the mean to the plot (bec this is about the mean)
points(pch=4, cex=2,
   x=mean(d$DO_LENSYLL),
   y=1)

3.3 Statistical testing

For this question, we would prefer to use a one-sample t-test, which requires normality of the values involved, which we already know isn’t the case:

nortest::lillie.test(d$DO_LENSYLL)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  d$DO_LENSYLL
D = 0.19409, p-value < 2.2e-16

Thus, we adjust our hypotheses to the ordinal level to which we fall back now:

  • text hypotheses:
    • H1: The median length of the non-native speakers’ DOs is different from the median length of your native speakers’ DOs;
    • H0: The median length of the non-native speakers’ DOs is not different from the median length of your native speakers’ DOs;
  • statistical hypotheses:
    • H1: the difference between your median and the one from the other study is not 0;
    • H0: the difference between your median and the one from the other study is 0.

First, we do the less good sign test. How many of our DO lengths are greater than the Finnish median?

sum(d$DO_LENSYLL>6)
[1] 44

44 of our 200 lengths are greater than the median in the Finnish data, but according to H0 it should be 100 (half of the data, because that’s what the median is). What is the probability to get up to 44 lengths greater than the expected median?

(lower_tail <- sum(  # make lower_tail the sum of
   dbinom(0:44,  # the sum of the probabilities to get 0, 1, ..., 43, 44 values (that are >6)
          200,   # in 200 trials (the number of VPCs you have)
          0.5))) # when the chance to get a greater result is 50% (H0)
[1] 3.418956e-16

But this is a two-tailed test so we have to do the same for ‘the other side’, the other 45 most extreme values:

(upper_tail <- sum(  # make lower_tail the sum of
   dbinom(156:200, # the sum of the probabilities to get 156, 157, ..., 199, 200 values (that are >6)
          200,     # in 200 trials (the number of VPCs you have)
          0.5)))   # when the chance to get a greater result is 50% (H0)
[1] 3.418956e-16

The total p value is then the sum of the two:

lower_tail + upper_tail
[1] 6.837911e-16

Which you can verify like this:

binom.test( # compute the probability of a binomial test for
   x=44,    # this number of 'successes'
   n=200,   # this number of trials, here the 200 constructions
   conf.level=0.95)$p.value # the confidence level we want; 0.95 = default
[1] 6.837911e-16

Then, we compute the (better) Wilcoxon signed-rank test

wilcox.test(      # compute a Wilcoxon signed-rank test of
   d$DO_LENSYLL,  # the DO lengths
   mu=6,          # against this expected median
   correct=FALSE) # don't use a continuity correction

    Wilcoxon signed rank test

data:  d$DO_LENSYLL
V = 4698, p-value = 4.726e-08
alternative hypothesis: true location is not equal to 6

3.4 Write-up

In the present verb-particle construction data, the mean/median length of the DOs is 4.72/3 respectively. To compare this to the mean/median of 5.8/6 in the Finnish learner corpus study, a one-sample t-test was considered, but was not permitted because, according to a Lilliefors test, the DO lengths in the current data are not normally distributed (p<10-15). Instead, a one-sample sign test and a Wilcoxon signed-rank test were computed, both of which concluded that the average of the Finnish learner data is significantly different from the current native speaker data: The two-tailed p-value of the one-sample sign test is 6.84×10-16, the Wilcoxon signed-rank test returned V=4698, with p=4.726×10-8.

3.5 Excursus

Here’s one approach one might try: We draw 10,000 samples with replacement from DO_LENSYLL and compute the mean for each. Then, we proceed as with confidence intervals and check whether the Finnish learner study’s mean is within the 95% confidence interval of our resampled data.

set.seed(123) # set a random number generator
sampled_means <- rep(NA, 10000) # generate a collector vector
for (i in 1:10000) { # do something 10K times, namely
   # draw a random sample (w/ replacement) from the DO lengths:
   sampled_DOLEN <- sample(d$DO_LENSYLL, replace=TRUE)
   # store the variance of that sample
   sampled_means[i] <- mean(sampled_DOLEN)
}
(qwe <- quantile(sampled_means, probs=c(0.025, 0.5, 0.975)))
 2.5%   50% 97.5%
4.155 4.715 5.340 

And then we plot and ‘test’:

hist(sampled_means, breaks=30, col="#FF000020",
     main="", xlab="Sampled means")
abline(v=qwe, col="red")
abline(v=5.8, lty=3, lwd=3, col="blue")

The mean of the DOs’ lengths in the current verb-particle construction data is ≈4.72. To compare this mean to that of the Finnish learner corpus data of 5.8, a one-sample t-test was considered, but was not permitted because, according to a Lilliefors test, the DO lengths in the current data are not normally distributed (p<10-15). Instead, a confidence interval was computed from the means of 10,000 bootstrapped versions of the current DO lengths and the corresponding 2.5% and 97.5% quantiles (4.16 and 5.34 respectively). Since the mean of the Finnish learners does not fall within that 95% confidence interval, it is concluded to be significantly different from the present data set: The learners are using DOs that are on average significantly longer than those of the native speakers.

4 Homework

To prepare for next session, read (and work through!) SFLWR3: Sections 4.3.2-4.4.

5 Session info

sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] STGmisc_1.0    Rcpp_1.0.14    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] digest_0.6.37     fastmap_1.2.0     xfun_0.50         nortest_1.0-4
 [5] knitr_1.49        htmltools_0.5.8.1 rmarkdown_2.29    cli_3.6.3
 [9] rstudioapi_0.17.1 tools_4.4.2       evaluate_1.0.3    yaml_2.3.10
[13] rlang_1.1.5       jsonlite_1.8.9    htmlwidgets_1.6.4 MASS_7.3-64