Ling 104, session 07: means & corrs (key)
1 Exercise 01
Question: We have known for a long time that the frequency of a word is one of several powerful predictors of how quickly people can recognize the word as a word on a screen: more frequent words are recognized faster. We now want to determine whether that’s also true for recently collected data on word recognition/reaction times collected for native speakers of English and Spanish when they were shown both English and Spanish words. Load the file _input/rts.csv into a data frame d. This file contains data from an experimental study on reaction times to words; you can find information about this data set in _input/rts.r. We want to see whether there is a correlation between ZIPFFREQ and RT; for that correlation, check the assumptions of the correlation you should want to compute and collect those results, but – for didactic reasons only – don’t act on the assumptions/results even if they turn out to be not met and pretend they’re fine. Once you’re done, consider briefly the following: ideally, we would want to see whether that frequency effect is observed even if we somehow control for length. (This is an important exercise and data set in the sense that, if you take more stats classes with me, this data set will feature in all of them to develop successively more comprehensive analyses.)
1.1 Hypotheses
The
- dependent/response variable is
RT; - independent/predictor variable is
ZIPFFREQ.
What are the hypotheses?
- text hypotheses:
- H1: there is a negative correlation between the (Zipf) frequency of a word and its recognition/reaction time;
- H0: there is no negative correlation between the (Zipf) frequency of a word and its recognition/reaction time;
- statistical hypotheses:
- H1: r<0;
- H0: r≥0.
1.2 Descriptive stats/visualization
We load the data:
CASE RT LENGTH LANGUAGE GROUP
Min. : 1 Min. : 271.0 Min. :3.000 english:4023 english :3961
1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 spanish:3977 heritage:4039
Median :4310 Median : 595.0 Median :5.000
Mean :4303 Mean : 661.5 Mean :5.198
3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
Max. :8610 Max. :4130.0 Max. :9.000
NAs :248
CORRECT FREQPMW ZIPFFREQ SITE
correct :7749 Min. : 1.00 Min. :3.000 a:2403
incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
Median : 42.00 Median :4.623 c:2782
Mean : 81.14 Mean :4.591
3rd Qu.: 101.00 3rd Qu.:5.004
Max. :1152.00 Max. :6.061
We do descriptive stuff: We visualize the dependent variable (the response) and the independent variable (the predictor):
Are the two variables normally distributed? No, they are not, but I told you to ignore that for now.
Lilliefors (Kolmogorov-Smirnov) normality test
data: d$RT
D = 0.15117, p-value < 2.2e-16
Lilliefors (Kolmogorov-Smirnov) normality test
data: d$ZIPFFREQ
D = 0.038074, p-value < 2.2e-16
What are the descriptive results for the correlation between the two variables?
1.3 Statistical testing
We test the correlation with a Pearson product-moment correlation (with a one-tailed p-value, remember the hypotheses) and a linear regression model:
Pearson's product-moment correlation
data: d$ZIPFFREQ and d$RT
t = -11.544, df = 7750, p-value < 2.2e-16
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
-1.00000 -0.11161
sample estimates:
cor
-0.1300222
Call:
lm(formula = RT ~ ZIPFFREQ, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-411.4 -153.1 -64.6 71.5 3486.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 949.36 25.11 37.80 <2e-16 ***
ZIPFFREQ -62.46 5.41 -11.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 260.9 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.01691, Adjusted R-squared: 0.01678
F-statistic: 133.3 on 1 and 7750 DF, p-value: < 2.2e-16
What is the exact p-value?
1.4 Write-up
According to a one-tailed Pearson product-moment correlation, there is a highly significant but apparently very weak negative correlation between RT and ZIPFFREQ (r=-0.13, mult. R2=0.017, t=-11.54, df=7750, pone-tailed<10-30). For every one-unit increase of ZIPFFREQ, words are recognized 62.5 ms faster. [show plot]
1.5 Excursus
Now, how we might we – in an amateurish way for now, we’ll do this better later – ‘control’ for the effect of length? We could compute the above correlation between ZIPFFREQ and RT for each length by looping over the data and computing as well as collecting the correlations and the regression line slopes for each length:
correlations <- slopes <- CIlowers <- CIuppers <- rep(NA, length(unique(d$LENGTH)))
for (i in seq(correlations)) {
d_curr <- d[d$LENGTH==i+2,]
m_curr <- lm(RT ~ ZIPFFREQ, data=d_curr, na.action=na.exclude)
correlations[i] <- summary(m_curr)$r.squared %>% sqrt %>% "*"(sign(coef(m_curr)[2]))
slopes[i] <- coef(m_curr)[2]
CIlowers[i] <- confint(m_curr)[2,1]
CIuppers[i] <- confint(m_curr)[2,2]
}Let’s see what the results are:
LENGTH NS CORRS SLOPES SLOPESlower SLOPESupper
1 3 198 -0.15758496 -69.60829 -133.22773 -5.988848
2 4 1910 -0.10650861 -41.74682 -59.63420 -23.859430
3 5 3066 -0.11900365 -58.07387 -75.43092 -40.716818
4 6 2025 -0.07532891 -40.92908 -64.80203 -17.056142
5 7 583 -0.09729075 -44.04661 -81.89019 -6.203032
6 8 176 -0.32784937 -244.93730 -353.09566 -136.778938
7 9 42 -0.24854012 -735.44933 -1663.79051 192.891851
It seems as if there is a correlation between ZIPFFREQ and RT even if LENGTH is controlled for (by holding it constant): every correlation and every slope between ZIPFFREQ and RT is negative as well and, with the exception of LENGTH being 9, all slopes of ZIPFFREQ have confidence intervals that do not include 0. While this is not how this should be done, it at least recognizes the general need for a multifactorial approach; later courses will tell you how to do this right.
2 Exercise 02
Question: In the same reaction time data, we want to see whether the two speaker groups – the variable GROUP, which distinguished between native speakers of English who were learners of Spanish (english) and heritage speakers of Spanish (heritage) – differed in their average reaction times to words. As above, check the assumptions of the test you should want to compute and collect those results, but – for didactic reasons only – don’t act on the assumptions/results even if they turn out to be not met and pretend they’re fine (This is an important exercise and data set in the sense that, if you take more stats classes with me, this data set will feature in all of them to develop successively more comprehensive analyses.)
2.1 Hypotheses
The
- dependent/response variable is
RT; - independent/predictor variable is
GROUP.
What are the hypotheses?
- text hypotheses:
- H1: there is a difference between the average reaction/recognition times for English learners of Spanish and Heritage speakers of Spanish;
- H0: there is no difference between the average reaction/recognition times for English learners of Spanish and Heritage speakers of Spanish;
- statistical hypotheses:
- H1: t≠0;
- H0: t=0.
2.2 Descriptive stats/visualization
We already loaded the data so/but we again now We visualize the dependent variable (the response) and the independent variable (the predictor):
Are the two sets of values of the response normally distributed? No, they are not, but I told you to ignore that for now.
$english
Lilliefors (Kolmogorov-Smirnov) normality test
data: X[[i]]
D = 0.15741, p-value < 2.2e-16
$heritage
Lilliefors (Kolmogorov-Smirnov) normality test
data: X[[i]]
D = 0.14387, p-value < 2.2e-16
What are the descriptive results for the correlation between the two variables?
$english
Min. 1st Qu. Median Mean 3rd Qu. Max. NAs sd
271.0000 499.0000 588.0000 663.9734 737.0000 4130.0000 168.0000 289.1716
$heritage
Min. 1st Qu. Median Mean 3rd Qu. Max. NAs sd
281.0000 514.5000 598.0000 659.0687 729.0000 2594.0000 80.0000 235.4064
It certainly does seem as if the two speaker groups were about equally fast.
2.3 Statistical testing
Because we ignore the violation of normality for didactic reasons, we test the difference correlation with a t-test for independent samples and also a linear regression model, but now not with a numeric independent variable/predictor (like always before and just above) but a binary independent variable/predictor. For the same didactic reasons as before, we also compute not the usual t-test version (Welch’s t-test), but a version where we force t.test to assume the standard deviations/variances in the two groups are identical; this one is often called Student’s t-test.
Two Sample t-test
data: d$RT by d$GROUP
t = 0.82051, df = 7750, p-value = 0.412
alternative hypothesis: true difference in means between group english and group heritage is not equal to 0
95 percent confidence interval:
-6.81303 16.62237
sample estimates:
mean in group english mean in group heritage
663.9734 659.0687
Call:
lm(formula = RT ~ GROUP, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-393.0 -156.1 -67.0 70.9 3466.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 663.973 4.272 155.431 <2e-16 ***
GROUPheritage -4.905 5.978 -0.821 0.412
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 263.1 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 8.686e-05, Adjusted R-squared: -4.216e-05
F-statistic: 0.6732 on 1 and 7750 DF, p-value: 0.412
2.4 Write-up
According to a two-tailed Student’s t-test for independent samples, there is no significant difference between the average reaction/recognition times between the English learners of Spanish (mean: 663.97 ms, sd=289.17 ms) and the Heritage speakers of Spanish (mean: 659.07 ms, sd=235.41 ms): t=0.82, df=7750, ptwo-tailed0.412). [show plot]
3 Exercise 03
Question: Earlier, we saw in the data from _input/partplacement.csv (see _input/partplacement.r) that the distributions of the lengths of the DOs of the verb-particle constructions differ across the two constructions (using the Kolmogorov-Smirnov test). Then, we followed that up with a test of whether the dispersions of the lengths of the DOs of the verb-particle constructions differ across the two constructions (using the Fligner-Killeen test). You now want to also test whether there is also a difference in the central tendency between the DO lengths in the two constructions. (That, too, could be explainable with processing considerations along the lines of short-before-long (since most particles are just one word or syllable long).)
3.1 Hypotheses
The
- dependent/response variable is
DO_LENSYLL; - independent/predictor variables is
CONSTRUCTION.
What are the hypotheses?
- text hypotheses:
- H1: The mean length of DOs in the construction v_do_prt is different from the mean length of DOS in the construction v_prt_do;
- H0: The mean length of DOs in the construction v_do_prt is not different from the mean length of DOS in the construction v_prt_do;
- statistical hypotheses:
- H1: t≠0;
- H0: t=0.
3.2 Descriptive stats/visualization
We load the data:
CASE CONSTRUCTION MEDIUM DO_COMPLX DO_LENSYLL
Min. : 1.00 v_do_prt:100 spoken :100 clausmod: 6 Min. : 1.00
1st Qu.: 50.75 v_prt_do:100 written:100 phrasmod: 67 1st Qu.: 2.00
Median :100.50 simple :127 Median : 3.00
Mean :100.50 Mean : 4.72
3rd Qu.:150.25 3rd Qu.: 6.00
Max. :200.00 Max. :31.00
DO_ANIM DO_CONC PP
animate : 27 abstract: 95 no :167
inanimate:173 concrete:105 yes: 33
We first describe the data; it seems the pattern could hardly be clearer …
$v_do_prt
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 2.00 2.46 3.00 9.00
$v_prt_do
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.00 6.00 6.98 9.25 31.00
3.3 Statistical testing
For this question, we would prefer to use a t-test for independent samples, which requires normality of the two groups to be compared, which we already know isn’t the case:
v_do_prt v_prt_do
4.519140e-15 1.897841e-07
Thus, we adjust our hypotheses to the ordinal level to which we fall back now:
- text hypotheses:
- H0: The median length of DOs in the construction v_do_prt is not different from the median length of DOS in the construction v_prt_do;
- H1: The median length of DOs in the construction v_do_prt is different from the median length of DOS in the construction v_prt_do;
- statistical hypotheses:
- H0: W=0;
- H1: W>0.
Then we compute the U-test:
Wilcoxon rank sum test
data: d$DO_LENSYLL by d$CONSTRUCTION
W = 1509.5, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
3.4 Write-up
[Show descriptive statistics.] To compare the average DO lengths in the two constructions, a t-test for independent samples was considered, but was not permitted because, according to a Lilliefors test, the DO lengths in the each construction are not normally distributed (both p<10-6). Instead, a U-test was computed, which concluded that the DO lengths of the two constructions are significantly different from each other (W=1509.5, p<10-15).
3.5 Excursus
What if our hypothesis had been one-tailed? Because of an expected short-before-long preference, it would have been reasonable and completely defensible to expect that the average length of DOs in v_do_prt is smaller than in v_prt_do. In that case, the test changes to the following:
Wilcoxon rank sum test
data: d$DO_LENSYLL by d$CONSTRUCTION
W = 1509.5, p-value < 2.2e-16
alternative hypothesis: true location shift is less than 0
The conclusion doesn’t change much: [Show descriptive statistics.] To see whether the average DO length in the construction v_do_prt is indeed shorter than in the construction v_prt_do, a one-tailed t-test for independent samples was considered, but was not permitted because, according to a Lilliefors test, the DO lengths in the each construction are not normally distributed (both p<10-6). Instead, a one-tailed U-test was computed, which concluded that the average DO length in the construction v_do_prt is indeed shorter than in v_prt_do (W=1509.5, pone-tailed<10-15).
4 Exercise 04
Question: Does a particular grammar exercise help improve the performance of second language learners. 10 learners do a grammar test and you note their numbers of correct answers. Then, the learners first do the grammar exercise you want to test and then participate in a second test (with the same number of questions). Again, you note the numbers of correct responses, hoping and expecting that the exercise leads to a better performance.
4.1 Hypotheses
The
- dependent/response variable is the number of correct responses;
- independent/predictor variable is the point of time of testing.
What are the hypotheses?
- text hypotheses:
- H1: The mean number of correct responses after the exercise is higher than the mean number of correct responses before the exercise;
- H0: The mean number of correct responses after the exercise is not higher than the mean number of correct responses before the exercise;
- statistical hypotheses (with t being based on before minus after):
- H1: t<0.
- H0: t&ge0;
4.2 Descriptive stats/visualization
The data are in _input/grmtest.csv, with their explanation in _input/grmtest.r; we load them from there:
STUDENT BEFORE AFTER
Min. : 1.00 Min. :1.0 Min. :4.00
1st Qu.: 3.25 1st Qu.:2.5 1st Qu.:5.25
Median : 5.50 Median :5.5 Median :6.00
Mean : 5.50 Mean :4.9 Mean :6.40
3rd Qu.: 7.75 3rd Qu.:7.0 3rd Qu.:7.75
Max. :10.00 Max. :8.0 Max. :9.00
This is not the usual case-by-variable format! We can actually do everything we want to do with this so-called wide format but, for consistency, we change this to what we are used to and what is nearly always better or even required:
d <- data.frame( # create a data frame x
CASE=1:20, # with case numbers
SUBJ=rep(1:10, 2), # with subject identifiers
CR=c(d$BEFORE, d$AFTER), # all numbers of correct responses in 1 column/variable
TT=rep(c("before", "after"), each=10), # all times of testing in 1 column/variable
stringsAsFactors=TRUE)
summary(d) CASE SUBJ CR TT
Min. : 1.00 Min. : 1.0 Min. :1.00 after :10
1st Qu.: 5.75 1st Qu.: 3.0 1st Qu.:4.00 before:10
Median :10.50 Median : 5.5 Median :6.00
Mean :10.50 Mean : 5.5 Mean :5.65
3rd Qu.:15.25 3rd Qu.: 8.0 3rd Qu.:7.00
Max. :20.00 Max. :10.0 Max. :9.00
Out of pure ‘anality’, we change the order of the levels of TT to one that is not alphabetical but chronological. (Actually, it’s not just anality, this kind of stuff can be really important later.)
CASE SUBJ CR TT
Min. : 1.00 Min. : 1.0 Min. :1.00 before:10
1st Qu.: 5.75 1st Qu.: 3.0 1st Qu.:4.00 after :10
Median :10.50 Median : 5.5 Median :6.00
Mean :10.50 Mean : 5.5 Mean :5.65
3rd Qu.:15.25 3rd Qu.: 8.0 3rd Qu.:7.00
Max. :20.00 Max. :10.0 Max. :9.00
We first describe the data, bearing in mind these are dependent samples already:
$before
Min. 1st Qu. Median Mean 3rd Qu. Max. sd
1.000000 2.500000 5.500000 4.900000 7.000000 8.000000 2.514403
$after
Min. 1st Qu. Median Mean 3rd Qu. Max. sd
4.000000 5.250000 6.000000 6.400000 7.750000 9.000000 1.837873
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.00 -3.75 -0.50 -1.50 1.00 3.00
Let’s visualize with the arrows plot:
plot(1, 10, type="n", axes=FALSE, # plot a point at {1,10} NOT, and no axes
xlim=1:2, ylim=c(0, 10), # w/ these axis limits &
xlab="Test", ylab="Correct responses") # these axis labels
axis(1, at=1:2, labels=c("Before", "After")) # add an x-axis with these tickmarks
axis(2); box(); grid() # add a y-axis, a box, a grid
arrows(1, # draw arrows from the starting points with x=1
d$CR[d$TT=="before"], # and y= the before scores
2, # to the end points with x=2
d$CR[d$TT=="after"], # and y= the after scores
col=ifelse(test_diffs>0, # if after > before
"green", # make the arrow green
"red")) # otherwise red
arrows(1, # draw one arrow from x=1 and
mean(d$CR[d$TT=="before"]), # y= the mean before score
2, # to x=2 and
mean(d$CR[d$TT=="after"]), # y= the mean after score
lwd=2) # with a heavy/bold lineThis looks like there might be an improvement (but also a lot of variability).
4.3 Statistical testing
For this question, we would prefer to use a t-test for dependent samples, which requires normality of the pairwise differences, which we check:
Lilliefors (Kolmogorov-Smirnov) normality test
data: test_diffs
D = 0.16034, p-value = 0.6601
We can proceed with the t-test:
Paired t-test
data: d$CR[d$TT == "before"] and d$CR[d$TT == "after"]
t = -1.3072, df = 9, p-value = 0.1118
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 0.6034255
sample estimates:
mean difference
-1.5
Yes, we could have gotten this with t.test(x=d$BEFORE, y=d$AFTER, paired=TRUE, alternative="less"), but now we’re getting it from the data being organized the right way … Also, see the excursus below for why this would have also worked: t.test(d$BEFORE-d$AFTER, mu=0, alternative="less").
4.4 Write-up
According to a one-tailed t-test for dependent samples, the before results (mean=4.9, sd=2.51) and the after results (mean=6.4, sd=1.84) are not significantly different from each other (t=-1.307, df=9, pone-tailed=0.1118).
4.5 Excursus
Note that the t-test for dependent samples A and B is actually a test for whether the pairwise differences of A and B have a mean of 0:
One Sample t-test
data: test_diffs
t = -1.3072, df = 9, p-value = 0.2235
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-4.095737 1.095737
sample estimates:
mean of x
-1.5
One Sample t-test
data: test_diffs
t = -1.3072, df = 9, p-value = 0.1118
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf 0.6034255
sample estimates:
mean of x
-1.5
Also, given the really small number of data points, it might still be advisable to also check with a Wilcoxon test:
Wilcoxon signed rank exact test
data: d$CR[d$TT == "before"] and d$CR[d$TT == "after"]
V = 17.5, p-value = 0.1836
alternative hypothesis: true location shift is less than 0
Ok, this confirms our result, excellent: According to a one-tailed Wilcoxon test for dependent samples, the before results (median=5.5, IQR=4.5) and the after results (median=6, IQR=2.5) are not significantly different from each other (V=13.5, pone-tailed=0.1421).
How might one test this using a simulation approach? Like this, and its one-tailed p-value is really not all that different from the one we got from the t-test for dependent samples:
set.seed(123) # set a random number generator
collector <- rep(NA, 2000) # set up a collector vector
for (i in 1:2000) { # do the following 2000 times
sampled_TT <- sample(d$TT) # reorder the testing times randomly
collector[i] <- diff( # store the difference between
tapply( # the two means resulting from
d$CR, # grouping these values
sampled_TT, # by these values
mean)) # & computing the mean for each group
}
hist(collector); abline(v=1.5, lty=2, col="red")[1] 0.0915
5 Exercise 05
What does this script exemplify?
rm(list=ls(all=TRUE)); set.seed(12) # clear all memory
x1 <- rnorm(50) ; y1 <- rnorm(50)
x2 <- rnorm(50)+5; y2 <- rnorm(50)+5
x <- append(x1, x2); y <- append(y1, y2)
plot(y ~ x, type="n"); abline(lm(y ~ x)); grid()
points(y1 ~ x1, col="#0000FF40", pch=19); abline(lm(y1 ~ x1), col="#0000FF40")
points(y2 ~ x2, col="#FF000040", pch=19); abline(lm(y2 ~ x2), col="#FF000040")
text(0, 6, "r for the\nblack line=0.88")It exemplifies that interpreting a correlation without visual inspection of the data is really dangerous: In a way that should remind you of slides 10-11 of session 1, if you just compute a correlation for all data points, you are lead to believe that there is a fairly strong positive correlation. A visual inspection (supported by the different colors of points and regression lines) immediately reveals that the data set actually appears to consist of two separate data sets, in each of which no correlation whatsoever can be found. The high r-value for the black regression line could only be taken somewhat seriously if the data looked like this, where the same correlation was found in both the blue and the red data:
rm(list=ls(all=TRUE)); set.seed(12) # clear all memory
x1 <- rnorm(50) ; y1 <- x1+rnorm(50, 0, 0.5)
x2 <- rnorm(50)+5; y2 <- x2+rnorm(50, 0, 0.5)
x <- append(x1, x2); y <- append(y1, y2)
plot(y ~ x, type="n"); abline(lm(y ~ x)); grid()
points(y1 ~ x1, col="#0000FF40", pch=19); abline(lm(y1 ~ x1), col="#0000FF80")
points(y2 ~ x2, col="#FF000040", pch=19); abline(lm(y2 ~ x2), col="#FF000080")
text(0, 6, "r for every\nline > 0.88")6 Exercise 06
Question: Deane (1987) posited that the choice of the of-genitive (e.g., the speech of the President) vs. the s-genitive (e.g., the President’s speech) was explainable with regard to the relevant referents’ position on an 11-point cognitive-entrenchment hierarchy (from 1 ‘lowest’ to 11 ‘highest’). Gries (1999) wanted to determine whether this entrenchment hierarchy would also explain the verb-particle construction alternation. He created examples of both constructions that used DOs from all 11 entrenchment levels (yielding 22 factor level combinations). Subjects were then given test sentences representing these factor level combinations and were instructed to rate the acceptability of the test sentences; 60 judgments were collected per factor level combination. The file in _input/entrenchment.csv contains the average judgments for each of the 22 combinations resulting from the experiment, see _input/entrenchment.r for the usual explanations. Based on Deane’s work, Gries had the hypotheses listed below; your job here is to test them and interpret the results.
6.1 Hypotheses
The
- dependent/response variable is
JUDGMENT; - independent/predictor variable is
ENTRENCHMENT.
but separately for each construction!
What are the hypotheses?
- text hypotheses for V_DO_PRT:
- H1: there is a positive correlation between the degree of entrenchment of the DO and the acceptability of the sentence;
- H0: there is no positive correlation between the degree of entrenchment of the DO and the acceptability of the sentence;
- statistical hypotheses for V_DO_PRT:
- H1: some correlation coefficient is >0;
- H0: some correlation coefficient is ≤0.
- text hypotheses for V_PRT_DO:
- H1: there is a negative correlation between the degree of entrenchment of the DO and the acceptability of the sentence;
- H0: there is no negative correlation between the degree of entrenchment of the DO and the acceptability of the sentence;
- statistical hypotheses for V_PRT_DO:
- H1: some correlation coefficient is <0;
- H0: some correlation coefficient is ≥0.
6.2 Descriptive stats/visualization
We load the data:
CASE CONSTRUCTION ENTRENCHMENT JUDGMENT
Min. : 1.00 v_do_prt:11 Min. : 1.00 Min. :26.20
1st Qu.: 6.25 v_prt_do:11 1st Qu.: 3.25 1st Qu.:59.32
Median :11.50 Median : 6.00 Median :72.88
Mean :11.50 Mean : 6.00 Mean :69.08
3rd Qu.:16.75 3rd Qu.: 8.75 3rd Qu.:84.56
Max. :22.00 Max. :11.00 Max. :92.23
We compute descriptive correlations:
Spearman's rho Kendall's tau
0.7727273 0.6000000
Spearman's rho Kendall's tau
-0.6000000 -0.4181818
This looks great! Let’s visualize the trend:
plot(d$JUDGMENT ~ d$ENTRENCHMENT,
xlab="Entrenchment" , xlim=c(1, 11),
ylab="Average acceptability", ylim=c(0, 100),
pch=substr(d$CONSTRUCTION, 3, 3), # pch = what's after the verb
col=rainbow(2)[d$CONSTRUCTION]); grid() # in two different colors; add grid
# add a regression line for v_do_prt
abline(model_d <- lm(
JUDGMENT ~ ENTRENCHMENT, data=d,
subset=CONSTRUCTION==levels(CONSTRUCTION)[1]),
col=rainbow(2)[1])
# add a regression line for v_do_prt
abline(model_p <- lm(
JUDGMENT ~ ENTRENCHMENT, data=d,
subset=CONSTRUCTION==levels(CONSTRUCTION)[2]),
col=rainbow(2)[2])
# add a legend
legend(x=4, xjust=0.5, y=20, yjust=0.5, # add a centered legend at these coordinates
legend=levels(d$CONSTRUCTION), # w/ these legend labels and
pch=substr(as.character(levels(d$CONSTRUCTION)), 3, 3), # these point characters &
col=rainbow(2), bty="n") # w/ these colors, no box6.3 Statistical testing
Let’s again use both ordinal correlation coefficients (Gries 1999 used Kendall’s τ only):
[[1]]
Spearman's rank correlation rho
data: ENTRENCHMENT and JUDGMENT
S = 50, p-value = 0.004031
alternative hypothesis: true rho is greater than 0
sample estimates:
rho
0.7727273
[[2]]
Kendall's rank correlation tau
data: ENTRENCHMENT and JUDGMENT
T = 44, p-value = 0.004973
alternative hypothesis: true tau is greater than 0
sample estimates:
tau
0.6
[[1]]
Spearman's rank correlation rho
data: ENTRENCHMENT and JUDGMENT
S = 352, p-value = 0.02809
alternative hypothesis: true rho is less than 0
sample estimates:
rho
-0.6
[[2]]
Kendall's rank correlation tau
data: ENTRENCHMENT and JUDGMENT
T = 16, p-value = 0.04328
alternative hypothesis: true tau is less than 0
sample estimates:
tau
-0.4181818
Everything’s significant in our one-tailed tests. And yet, if one pays attention, at least half of it is actually problematic. We can see that from fitting a smoother, which one should always do:
# this is all as before:
plot(d$JUDGMENT ~ d$ENTRENCHMENT,
xlab="Entrenchment", , xlim=c(1, 11),
ylab="Average acceptability", ylim=c(0, 100),
pch=substr(d$CONSTRUCTION, 3, 3), # pch = what's after the verb
col=rainbow(2)[d$CONSTRUCTION]); grid() # in two different colors; add grid
# add a legend
legend(x=4, xjust=0.5, y=20, yjust=0.5, # add a centered legend at these coordinates
legend=levels(d$CONSTRUCTION), # w/ these legend labels and
pch=substr(as.character(levels(d$CONSTRUCTION)), 3, 3), # these point characters &
col=rainbow(2), bty="n") # w/ these colors, no box
# now the new stuff: add the smoother
lines(lowess(d_split$v_do_prt$JUDGMENT ~ 1:11), col=rainbow(2)[1])
lines(lowess(d_split$v_prt_do$JUDGMENT ~ 1:11), col=rainbow(2)[2])The correlation for v_prt_do looks not that great anymore. In fact, the overall negative correlation is completely due to the highest three entrenchment levels:
with(d[d$ENTRENCHMENT<9,], {
plot(JUDGMENT ~ ENTRENCHMENT,
xlab="Entrenchment", , xlim=c(1, 11),
ylab="Average acceptability", ylim=c(0, 100),
pch=substr(CONSTRUCTION, 3, 3), # pch = what's after the verb
col=rainbow(2)[CONSTRUCTION]); grid() # in two different colors; add grid
# add a regression line for v_do_prt
abline(model_d <- lm(
JUDGMENT ~ ENTRENCHMENT,
subset=CONSTRUCTION==levels(CONSTRUCTION)[1]),
col=rainbow(2)[1])
# add a regression line for v_do_prt
abline(model_p <- lm(
JUDGMENT ~ ENTRENCHMENT,
subset=CONSTRUCTION==levels(CONSTRUCTION)[2]),
col=rainbow(2)[2])
# add a legend
legend(x=4, xjust=0.5, y=20, yjust=0.5, # add a centered legend at these coordinates
legend=levels(d$CONSTRUCTION), # w/ these legend labels and
pch=substr(as.character(levels(d$CONSTRUCTION)), 3, 3), # these point characters &
col=rainbow(2), bty="n") # w/ these colors, no box
})Now the correlations are much worse – the ones for v_do_prt mostly due to the reduction in sample size, but the ones for v_prt_do because now the only thing that’s responsible for it is gone …
[[1]]
Spearman's rank correlation rho
data: ENTRENCHMENT[1:8] and JUDGMENT[1:8]
S = 32, p-value = 0.05749
alternative hypothesis: true rho is greater than 0
sample estimates:
rho
0.6190476
[[2]]
Kendall's rank correlation tau
data: ENTRENCHMENT[1:8] and JUDGMENT[1:8]
T = 21, p-value = 0.05434
alternative hypothesis: true tau is greater than 0
sample estimates:
tau
0.5
[[1]]
Spearman's rank correlation rho
data: ENTRENCHMENT[1:8] and JUDGMENT[1:8]
S = 86, p-value = 0.4884
alternative hypothesis: true rho is less than 0
sample estimates:
rho
-0.02380952
[[2]]
Kendall's rank correlation tau
data: ENTRENCHMENT[1:8] and JUDGMENT[1:8]
T = 14, p-value = 0.5476
alternative hypothesis: true tau is less than 0
sample estimates:
tau
0
The sensitivity of the regression line to what the three highest ENTRENCHMENT values do is shown in the animation here; note how the regression line only gets to have a negative slope when ENTRENCHMENT becomes 9 or higher:
plot(d$JUDGMENT[12:22] ~ d$ENTRENCHMENT[12:22],
type="n", lty=2, lwd=2,
xlim=c(1, 11), ylim=c(0, 100),
xlab="Entrenchment", ylab="Mean acceptability judgment"); grid()
points(d$JUDGMENT[12:13] ~ d$ENTRENCHMENT[12:13], type="p", pch="p", col=rainbow(2)[2])
for (i in 14:22) {
points(d$JUDGMENT[i] ~ d$ENTRENCHMENT[i], type="p", pch="p", col=rainbow(2)[2])
abline(lm((d$JUDGMENT[12:i] ~ d$ENTRENCHMENT[12:i])), col=rainbow(2)[2])
Sys.sleep(1)
}6.4 Write-up
Initial correlations were promising: [show plot and initial correlation coefficients]. However, closer inspection (visually, with smoothers and subsets of the data [explain in more detail]) indicate that, while the positive correlation for v_do_prt seems supported overall, the negative correlation for v_prt_do seems entirely due to a bifurcation in the data:
- when
ENTRENCHMENTis 8 or lower, there is no correlation withJUDGMENTand the sentences exhibit an overall high degree of acceptability (of ≈70); - when
ENTRENCHMENTis 9 or higher, there is no correlation withJUDGMENTand the sentences exhibit an overall low degree of acceptability (of ≈30).
Thus, it seems that one might consider the overall entrenchment hierarchy supported for v_do_prt, but not for v_prt_do, where, if anything, the 11-point entrenchment hierarchy collapses into a two-way distinction of 1st/2nd/3rd person pronouns (ENTRENCHMENT values in the interval [9, 11]) and everything else. While the acceptability differences of those two groups of ENTRENCHMENT values is still compatible with the overall thrust of the hypothesis, it seems to certainly weakens the overall utility of Deane’s hierarchy for particle placement. At least if the analysis wasn’t a load of crap in one other major way as well, namely …?
6.5 Excursus
Is the difference in the judgments between ENTRENCHMENT being 1:8 vs. being 9:11 significant? Because if it is, that could be at least a tiny bit of an effect as desired …
FALSE TRUE
29.76667 69.90000
A t-test for independent samples doesn’t seem like a good idea because one of the two groups of course has only 3 data points. Thus, we do something that I haven’t mentioned so far (and won’t mention again so, breathe, “this isn’t on the test”): permutation-based exact tests. They are somewhat ‘similar’ (can I hedge even more?) to the bootstrapping/simulation approaches, but this time we don’t draw randomly (with replacement) a certain number of times, this time we literally generate all possible combinations of the data exhaustively and count how many of all possible results are as or more extreme than ours. Step 1: all possible ways to take three values from the 11 for v_prt_do:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
[3,] 1 2 5
[4,] 1 2 6
[5,] 1 2 7
[6,] 1 2 8
[7,] 1 2 9
[8,] 1 2 10
[9,] 1 2 11
[10,] 1 3 4
[11,] 1 3 5
[12,] 1 3 6
[13,] 1 3 7
[14,] 1 3 8
[15,] 1 3 9
[16,] 1 3 10
[17,] 1 3 11
[18,] 1 4 5
[19,] 1 4 6
[20,] 1 4 7
[21,] 1 4 8
[22,] 1 4 9
[23,] 1 4 10
[24,] 1 4 11
[25,] 1 5 6
[26,] 1 5 7
[27,] 1 5 8
[28,] 1 5 9
[29,] 1 5 10
[30,] 1 5 11
[31,] 1 6 7
[32,] 1 6 8
[33,] 1 6 9
[34,] 1 6 10
[35,] 1 6 11
[36,] 1 7 8
[37,] 1 7 9
[38,] 1 7 10
[39,] 1 7 11
[40,] 1 8 9
[41,] 1 8 10
[42,] 1 8 11
[43,] 1 9 10
[44,] 1 9 11
[45,] 1 10 11
[46,] 2 3 4
[47,] 2 3 5
[48,] 2 3 6
[49,] 2 3 7
[50,] 2 3 8
[51,] 2 3 9
[52,] 2 3 10
[53,] 2 3 11
[54,] 2 4 5
[55,] 2 4 6
[56,] 2 4 7
[57,] 2 4 8
[58,] 2 4 9
[59,] 2 4 10
[60,] 2 4 11
[61,] 2 5 6
[62,] 2 5 7
[63,] 2 5 8
[64,] 2 5 9
[65,] 2 5 10
[66,] 2 5 11
[67,] 2 6 7
[68,] 2 6 8
[69,] 2 6 9
[70,] 2 6 10
[71,] 2 6 11
[72,] 2 7 8
[73,] 2 7 9
[74,] 2 7 10
[75,] 2 7 11
[76,] 2 8 9
[77,] 2 8 10
[78,] 2 8 11
[79,] 2 9 10
[80,] 2 9 11
[81,] 2 10 11
[82,] 3 4 5
[83,] 3 4 6
[84,] 3 4 7
[85,] 3 4 8
[86,] 3 4 9
[87,] 3 4 10
[88,] 3 4 11
[89,] 3 5 6
[90,] 3 5 7
[91,] 3 5 8
[92,] 3 5 9
[93,] 3 5 10
[94,] 3 5 11
[95,] 3 6 7
[96,] 3 6 8
[97,] 3 6 9
[98,] 3 6 10
[99,] 3 6 11
[100,] 3 7 8
[101,] 3 7 9
[102,] 3 7 10
[103,] 3 7 11
[104,] 3 8 9
[105,] 3 8 10
[106,] 3 8 11
[107,] 3 9 10
[108,] 3 9 11
[109,] 3 10 11
[110,] 4 5 6
[111,] 4 5 7
[112,] 4 5 8
[113,] 4 5 9
[114,] 4 5 10
[115,] 4 5 11
[116,] 4 6 7
[117,] 4 6 8
[118,] 4 6 9
[119,] 4 6 10
[120,] 4 6 11
[121,] 4 7 8
[122,] 4 7 9
[123,] 4 7 10
[124,] 4 7 11
[125,] 4 8 9
[126,] 4 8 10
[127,] 4 8 11
[128,] 4 9 10
[129,] 4 9 11
[130,] 4 10 11
[131,] 5 6 7
[132,] 5 6 8
[133,] 5 6 9
[134,] 5 6 10
[135,] 5 6 11
[136,] 5 7 8
[137,] 5 7 9
[138,] 5 7 10
[139,] 5 7 11
[140,] 5 8 9
[141,] 5 8 10
[142,] 5 8 11
[143,] 5 9 10
[144,] 5 9 11
[145,] 5 10 11
[146,] 6 7 8
[147,] 6 7 9
[148,] 6 7 10
[149,] 6 7 11
[150,] 6 8 9
[151,] 6 8 10
[152,] 6 8 11
[153,] 6 9 10
[154,] 6 9 11
[155,] 6 10 11
[156,] 7 8 9
[157,] 7 8 10
[158,] 7 8 11
[159,] 7 9 10
[160,] 7 9 11
[161,] 7 10 11
[162,] 8 9 10
[163,] 8 9 11
[164,] 8 10 11
[165,] 9 10 11
Step 2: we take each row of all_ways_to_take_3 and use it to
- split the 11 values of
x_split$v_prt_do$JUDGMENTinto- the 3 values that are indexed in the row of
all_ways_to_take_3; - the remaining 8; and to then
- the 3 values that are indexed in the row of
- compute the difference between the means of
JUDGMENTof those two groups:
Step 3: we check how many of all the theoretically possible differences between those means is as small or smaller than ours and express that as a p-value:
Thus, according to an exact (permutation-based) t-test for independent samples, the averages of JUDGMENT for ENTRENCHMENT being 1 to 8 and for ENTRENCHMENT being 9 to 11 is significant (pone-tailed=0.006).
On my computer, there’s of course a function for that:
7 Session info
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Linux Mint 22.3
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] STGmisc_1.06 Rcpp_1.1.1-1.1 magrittr_2.0.5
loaded via a namespace (and not attached):
[1] digest_0.6.39 fastmap_1.2.0 xfun_0.57 nortest_1.0-4
[5] knitr_1.51 htmltools_0.5.9 rmarkdown_2.31 cli_3.6.6
[9] rstudioapi_0.18.0 tools_4.6.0 evaluate_1.0.5 yaml_2.3.12
[13] otel_0.2.0 htmlwidgets_1.6.4 rlang_1.2.0 jsonlite_2.0.0
[17] MASS_7.3-65