Ling 201, session 01: intro to statistical thinking

Stefan Th. Gries

UC Santa Barbara

JLU Giessen

05 Jan 2025 12-34-56

Introduction

The scientific method

Figure 1: Scientific method

A few self-evident objectives of empirical scientific inquiry

Description, answering the question “what happens/happened?”
explanation, answering the question “why does x happen?”
prediction, answering the question “what will happen with x if …?”
control, answering the question “how can x be influenced?”

But why use statistics for this?

To describe, explain, and predict
- objectively
- precisely
- comparably
- concisely
to cope with variability and to generalize: different samples even from the same population will yield different results
thus, we need to be able to
- quantify this variability
- separate random from systematic/meaningful variability
to assess the robustness of one’s generalizations

Three absolutely central notions

Objectivity: independence of personal opinions
reliability: precision (in the sense of ‘re-test reliability’)
validity: one measures what one wants to measure; in a sense, this is probably the most important one

Pitfalls you can avoid with proper quantitative analysis

Two English verbs verb₁ and verb₂

A published study discussed the complementation preferences verb₁ and verb₂ with regard to two grammatical patterns on the basis of the following data:

addmargins(example_1 <-matrix(c(295, 131, 104, 35), ncol=2,
   dimnames=list(VERB=1:2, PATTERN=1:2)))

     PATTERN
VERB    1   2 Sum
  1   295 104 399
  2   131  35 166
  Sum 426 139 565

    PATTERN
VERB    1    2
   1 0.74 0.26
   2 0.79 0.21

conclusion drawn from this: “[c]omparing the postverbal elements in the two verbs, we can see that the proportion of [pattern₁] for [verb₂] is higher than for [verb₁]” …
yes, 79% > 74%, but a certain statistical test would have shown that the distribution is not significantly different from chance:


    Pearson's Chi-squared test

data:  example_1
X-squared = 1.5679, df = 1, p-value = 0.2105

thus, with this test, the author would have avoided making an incorrect overgeneralization.

Two English verbs verb₁ and verb₂

Another study on two English verbs verb₁ and verb₂ discussed their complementation preferences with regard to 5 kinds of XPs on the basis of the following data:

addmargins(example_2 <- matrix(c(302,73, 8,0, 145,5, 19,3, 8,0), ncol=5,
   dimnames=list(VERB=1:2, PATTERN=c("NP", "PP", "VP", "AdjP", "AdvP"))))

     PATTERN
VERB   NP PP  VP AdjP AdvP Sum
  1   302  8 145   19    8 482
  2    73  0   5    3    0  81
  Sum 375  8 150   22    8 563

“we find that (a) [verb₁] is more common before noun-phrases than before other constituents” …
yes, 302 is largest figure in the first row, or even the whole table, but the focus of much of the study was on verb₁ vs. verb₂, and compared to verb₂, verb₁ actually disprefers to occur before NPs (as shown by residuals):

    PATTERN
VERB    NP    PP    VP  AdjP  AdvP
   1 -1.06  0.44  1.46  0.04  0.44
   2  2.59 -1.07 -3.57 -0.09 -1.07

Thus, with this test, the author would have avoided their oversight.

Avoiding complete surprises 1

Figure 2: A correlation between two variables XX and YY

Avoiding complete surprises 2

Figure 3: A correlation between two variables XX and YY, controlled for a 3rd variable FF

Caveats: note, however

Statistics don’t provide content –- the researcher does
statistics are only useful to the extent that the researcher has been successful
- in operationalizing his variables appropriately
- eliciting/collecting the data correctly
- choosing the right statistical technique

The phases of empirical quantitative studies

The phases of an empirical study

reconnaissance, leading to variables
hypotheses (text and statistical forms)
data collection ((operationalizations of) variables)
evaluation of hypotheses given the data with
- effect sizes
- graphs
- significance tests (p-values)

Phase 1 and 2: the notion of variables

Variables
- are measurable properties or characteristics of an item
- vary across different items, where “items” are the individual measurements of the ‘thing of interest’ (see below); items can be people, events (words, utterances, …)
non-linguistic examples:
- annual income, number of children, IQ, …
- party someone voted for in the last election, hair color, marital status, …
linguistic examples:
- reaction time to a word, length of a word, …
- animacy of a subject noun phrase: human (Peter) vs. animate (the cat) vs. inanimate concrete object (the table) vs. abstract (time), …
note: we can/need to decide on a resolution. For annual income:
- numbers: the exact amount? the amount rounded to full US$?
- ranked classes: ‘negative’, 0-30,000, 30,001-60,000, 60,001-100,000, 100,001-?
- categories: none vs. any? Or above average vs. more than/above average?

Phase 1 and 2: variable types, part 1

Variables can be distinguished in terms of their information value
- categorical: different values → different properties
- ordinal: categorical + different values → different ranks
- numeric: categorical + ordinal + different values → sizes of differences
here are results of a fictitious results of an Olympic 100m dash – what is the information level of each variable in a column?

`TIME`	`RANK`	`NAME`	`NUMBER`	`MEDAL`
9.86	1	S. Davis	453473	1
9.91	2	J. White	563456	1
10.01	3	S. Hendry	756675	1
20.02	4	C. Lewis	585821	0

TIME: num, RANK: ord, NAME/NUMBER: cat, MEDAL: depends

Phase 1 and 2: variable types, part 2

Variables can be distinguished in terms of their role in an investigation
- response/dependent: the variable whose values/behavior/variation we want to explain
- predictor/independent: often the assumed cause for the behavior of the response

confounds (controlled, accounted for, or residualized out)
moderators (accounted for by interactions w/ add. variables)
colliders (accounted for differently)

Phase 1 and 2: variable types, practice

In the following non-linguistic examples of text hypotheses, what is the response, what is the predictor, and what are the variables’ information values?
- people with a univ degree are more intelligent than people without one
- response: IQ (num) ~ predictor: HASUNIVDEGREE (cat): no vs. yes
- men are better at parking than women
- response: PARKINGSKILL (?) ~ predictor: SEX/GENDER (cat): female vs. male

Phase 1 and 2: variable types, practice

In the following linguistic examples of text hypotheses, what is the response, what is the predictor, and what are the variables’ information values?
- in essays, non-native speakers make more mistakes than native speakers
- response: NUMBERofMISTAKES (num)~ predictor: SPEAKERTYPE (cat): nns vs. ns
- subjects are shorter than objects
- response: LENGTH (num) ~ predictor: GRAMREL (cat): object vs. subject

Phase 2: What are hypotheses?

What are hypotheses? One definition:
- universal statements (going beyond a singular event)
- implicit structure of a conditional sentence
  - if [predictor] …, then [response] …
  - the more/less [predictor] …, the more/less [response] …
- potentially falsifiable
- empirically testable
maybe most useful definition: a statement postulating a distribution of one or more response variables
hypotheses come in different kinds

Phase 2: Kinds of hypotheses

text hypotheses vs. statistical hypotheses (→ operationalization)
alternative hypothesis H₁: a statement postulating
- a particular distribution of a (response) variable (goodness-of-fit)
- a relation between 1+ predictors & 1+ response variables (independence/difference(s))
  - stipulating some difference, but not its direction: non-directional/2-tailed
  - e.g., subjects and objects differ in their lengths
  - stipulating a difference and its direction: directional/1-tailed
  - e.g., subjects are shorter than objects
null hypothesis H₀:
- the logical counterpart to H₁: an alternative hypothesis with not in it

Phase 2: Operationalization 1

Operationalization: the step from text hypotheses to statistical hypotheses
- step 1: phrasing the variables in the text hypotheses such that they involve numbers
- step 2: choose a statistical measure to be applied to those numbers
non-linguistic examples
- parking performance
- physical fitness
- financial wealth
linguistic examples
- the knowledge of a foreign language
- the lengths of subjects and objects

Phase 2: Operationalization 2

Operationalization: the step from text hypotheses to statistical hypotheses
- step 1: phrasing the variables in the text hypotheses such that they involve numbers
- step 2: choose a statistical measure to be applied to those numbers
most frequent statistical measures:
- counts/frequencies
- averages/means
- correlations
- distributions and dispersions
what statistic to use for the lengths of subjects & objects?
- summed total of the above (i.e. counts/frequencies)?
- means of the above (i.e. averages/means)?

Phase 2: an example

Imagine you have the following alternative text hypothesis: “subjects are shorter than objects in English”
- what’s the corresponding null hypothesis?
- “subjects are not shorter than objects in English”
what variables are involved?
- response: LENGTH (numeric) ~ predictor: GRAMREL (binary/categorical)
how to operationalize them?
- LENGTH: let’s use length in words
- GRAMREL:
  - object: the NP that is the ‘target’ of the action of a transitive verb and could become the subject of the sentence upon passivization
  - subject: the NP determining verbal morphology/agreement and that, prototypically, denotes the agent of the action denoted by the verb
what statistic to use?
- average length of all objects vs. average length of all subjects

Phase 3: Data storage rules

Imagine you wanted to study this using corpus data
imagine you collected the following data set:
- The younger bachelors ate the nice little cat
- He was locking the door
- The quick brown fox hit the lazy dog
rule: store the data in the so-called case-by-variable format:
- each data point (i.e., measurement of the response variable) has a row on its own
- every variable or every other characteristic of a data point has a column on its own
- the very first row contains the names of all variables (header)
- missing data are marked as NA –- do not use empty cells!
- do not use numbers for categorical variables

Phase 3: Data storage (wrong)

Table 1: Terribly wrong format

SENTENCE	SUBJ	OBJ
The younger bachelors ate the nice little cat	3	4
He was locking the door	1	2
The quick brown fox hit the lazy dog	4	3

remember: each data point should have a row on its own
remember: every variable should have a column on its own
how many data points? 6, but …
- … each row has 2 data points (of LENGTH), not 1
how many variables? 2: LENGTH and GRAMREL, but …
- each of the columns 2 and 3 represents the levels of a variable (GRAMREL), not a variable

Phase 3: Data storage (right)

Something like this would be the correct format:

Table 2: Correct format

`CASE`	`ITEM`/`SENTENCE`	`LENGTH`	`GRAMREL`
1	The younger bachelors ate the nice little cat	3	subj
2	The younger bachelors ate the nice little cat	4	obj
3	He was locking the door	1	subj
4	He was locking the door	2	obj
5	The quick brown fox hit the lazy dog	4	subj
6	The quick brown fox hit the lazy dog	3	obj

how many variables? 2, and that’s the two main columns on the right
how many data points? 6, and that’s how many rows we have

Phase 3: Data storage: direct comparison

The logic of hypothesis testing

The scientific method

The logic of statistical testing is that of hypothesis falsification:
- one does not prove that one’s own H₁ is correct
- one ‘proves’ that the H₀ is wrong, which means one’s H₁ is right
steps:
- one defines a significance level p_critical, which quantifies how quickly one will reject H₀ / accept H₁
- one computes the effect e observed in one’s data (using the statistic from the statistical hypothesis)
- one computes the probability of error p how likely it is to find e if H₀ is correct
- decision
  - if p<p_critical, one rejects H₀ and accepts H₁
  - if p≥p_critical, one must stick to H₀ and cannot accept H₁

Coin tossing, scenario 1

You and I play a game, tossing a coin 100 times: heads: $1 for me; tails: $1 for you
your hypotheses:
- H₀: both players are honest: p_heads=p_tails=0.5
- H₁: STG is not honest: p_heads>0.5 and p_tails<0.5
the significance level is set to 0.05
now, how often do you have to lose before you begin to accuse me of cheating a.k.a. accepting H₁?
- when you lose 51 times?
- when you lose 55 times?
- when you lose 59 times?
what are you doing? You’re looking at an effect e (the result STG: 3 vs. you: 0, i.e. your losses) and are determining when e becomes too unlikely to still believe in H₀

Tossing a coin just 3 times

You set the significance level: p_critical=0.05
we play, you lose 3 times out of 3: the effect e is 3:0

Toss 1	Toss 2	Toss 3	Heads	Tails	p_result
heads	heads	heads	3	0	0.125
heads	heads	tails	2	1	0.125
heads	tails	heads	2	1	0.125
heads	tails	tails	1	2	0.125
tails	heads	heads	2	1	0.125
tails	heads	tails	1	2	0.125
tails	tails	heads	1	2	0.125
tails	tails	tails	0	3	0.125

probability of error p=0.125
decision: p>p_critical: you must stick to H₀

Tossing a coin more often

Coin tossing, scenario 2

You and I play a game, tossing a coin 100 times: heads: $1 for me; tails: $1 for you
an independent observer’s hypotheses:
- H₀: both players are honest: p_heads=p_tails=0.5
- H₁: at least one player is not honest: p_heads>0.5 or p_heads<0.5
the significance level is set to 0.05
now, how often does one of us have to lose before the independent observer begins to accuse the other of cheating a.k.a. accepting H₁?
- when someone loses 51 times?
- when someone loses 56 times?
- when someone loses 61 times?
what is the independent observer doing? Looking at an effect e (the results someone: 3 vs. someone else: 0, i.e. someone’s losses) and determining when e becomes too unlikely to still believe in H₀

Tossing a coin just 3 times

An independent observer sets the significance level: p_critical=0.05
we play, one of us (you) loses 3 times out of 3: the effect e is 3:0

Toss 1	Toss 2	Toss 3	Heads	Tails	p_result
heads	heads	heads	3	0	0.125
heads	heads	tails	2	1	0.125
heads	tails	heads	2	1	0.125
heads	tails	tails	1	2	0.125
tails	heads	heads	2	1	0.125
tails	heads	tails	1	2	0.125
tails	tails	heads	1	2	0.125
tails	tails	tails	0	3	0.125

probability of error p=0.125 (from 3:0) + 0.125 (from 0:3) = 0.25
decision: p>p_critical: the observer must stick to H₀

Tossing a coin more often

Lessons to be learned, part 1

Lesson 1 is about distributions and parametric testing:
in this case of binomial trials, with increasing sample sizes,
- we obtain a bell-shaped normal distribution
- even if the ‘input probability’ is not normal
thus, if the sample sizes are large enough and the distribution looks like one we can describe easily, then …
we can use a parametric/asymptotic test – but only then!

Lessons to be learned, part 2

Lesson 2 is about alternative hypotheses: there are
- directional/one-tailed alternative hypotheses:
  - postulate an effect, a difference, a correlation,
  - and its direction (above, you)
- non-directional/two-tailed alternative hypotheses:
  - postulate an effect, a difference, a correlation,
  - but not its direction (above: the independent observer)
prior knowledge is rewarded: the former was/are easier to accept
but where are those p-values coming from?

Phase 4, evaluation and interpretation

Choosing a method/test, part 1

What kind of study is being conducted
- descriptive, exploratory, hypothesis-generating
- hypothesis-testing
how many and what kinds of variables are involved?
- 1 response (goodness-of-fit tests)
- 1 response & 1 predictor (monofactorial test for independence or differences)
- 1 response & 2+ predictors (multifactorial analyses)
- 2 responses (multivariate analyses)
are data points related such that you can associate them with each other in a meaningful principled way?
- no: tests for independent samples
- yes: tests for dependent samples
- the latter are usually more powerful

Choosing a method/test, part 2

What is the statistic of the dependent variable in the statistical hypothesis?
- counts/frequencies, → often chi-squared tests
- distributions → often Kolmogorov-Smirnov test
- averages/means, → often t-tests
- dispersions → often F-tests
- correlations, → often r or ρ or τ
What does the distribution of the data look like?
- normal: often leads to parametric tests
- non-normal: often leads to non-parametric, simulation, or exact tests
how big are the samples to be collected?
- <30: often a risk to normality assumptions
- ≥30: often supporting normality assumptions

Significance testing (again)

Your results section should usually include
- the observed effect e
- some significance results from some test(s)
- how both these aspects of your results relate to your hypotheses
but again: the p-value indicates how likely the observed result is given the H₀ – nothing else

Recall that the standard p-value required in the humanities and social sciences
is 0.05. [...] What does this statistical significance mean? It means that
there is at least a 95% chance that the null hypothesis is *incorrect*.

this is completely wrong:
- this author: p is p(H₀ = FALSE | data)
- actually: p is p(data | H₀ = TRUE)
often, people distinguish ‘levels of significance’:
- p<0.001 (highly significant) vs. 0.01>p≥0.001 (very significant) vs. 0.05>p≥0.01 (significant)
- 0.1>p≥0.05: marginally significant – stupid, don’t use this

Effect sizes

As mentioned above, your results should also include effect sizes
effect sizes are correlated with p-values, but not deterministically so: often
- strong effects will be significant, and
- weak effects will be insignificant
but,
- given large sample sizes, even very weak effects can be significant
- given large variability, even strong effects can be insignificant

Learner	of-gen	s-gen	Sum
Chinese	20	15	35
German	15	20	35
Sum	35	35	70

   p-value odds ratio
    0.2320     1.7778

Learner	of-gen	s-gen	Sum
Chinese	200	150	350
German	150	200	350
Sum	350	350	700

   p-value odds ratio
    0.0002     1.7778

you must keep significance and effect size separate in your head
- significance: how likely is the effect when ‘in reality, there’s nothing’?
- effect size: how big/strong is the effect regardless of whether it’s random?

Approaching significance with simulation

For 20 nouns, you measured …

a predictor IMAGEABILITY: whether one can imagine/visualize the referent of the noun (n: ‘no’ vs. y: ‘yes’)
a response RT: a reaction time score ranging from 1 (fastest) to 20 (slowest)
check out this nearly perfect correlation:

Figure 4: The correlation between RT and IMAGEABILITY

How do we determine whether that effect e is significant?

The observed effect e (n-y) is 14 minus 7 = 7 but H₀ hypothesizes an effect of 0
how about we generate relevant H₀ data and check how the observed effect e compares to those H₀ data?
relevant H₀ data
- have the same IMAGEABILITY frequencies of n and y (10 each), &
- have the same values of RT, but
- are somehow random and, thus, compatible with H₀ – how?
simple: we destroy the association of RT ~ IMAGEABILITY (n & slow / y & fast) by randomly reordering the values of the predictor IMAGEABILITY!

RT ~ IMAGEABILITY (randomized 1)

set.seed(1); d_rand <- data.frame(RT=d$RT, IMAGEABILITY=sample(d$IMAGEABILITY))

Figure 5: A correlation between RT and randomized IMAGEABILITY 1

RT ~ IMAGEABILITY (randomized 2)

d_rand <- d_rand <- data.frame(RT=d$RT, IMAGEABILITY=sample(d$IMAGEABILITY))

Figure 6: A correlation between RT and randomized IMAGEABILITY 2

RT ~ IMAGEABILITY (randomized 3)

d_rand <- d_rand <- data.frame(RT=d$RT, IMAGEABILITY=sample(d$IMAGEABILITY))

Figure 7: A correlation between RT and randomized IMAGEABILITY 3

We need this much more often …

Let’s generate not 3, but 100,000 random H₀ distribution, i.e. 100,000 effects e, …

Figure 8: The first 10 of 100,000 random distributions

… which should come down, i.e. average, to 0

But how do we evaluate this?

Figure 9: A histogram of the H0 correlations between RT and IMAGEABILITY

We can represent all the H₀ effects e_1-100,000, e.g. in a histogram
we can add a vertical line to the histogram that represents the actual observed effect e of 7
we can count how often we get a value of 7 or higher in the H₀ data and …
… express that as a percentage – that is p
here, p is 0.0034 – the observed difference of 7 between objects and subjects is significant(ly different from 0)

How well does this work?

Reminder: the p-value from the simulation is 0.0034
the ’gold standard p-value from an exact (!) t-test for independent samples is 0.00342, …
… which means that the simulation approach scores a nearly perfect result
what about the parametric t-test (according to Welch)? Its p-value is 0.00232, which is also pretty close (but worse than the simulation!)
what about the parametric t-test (according to Student)? Its p-value is 0.00227, which is also close (but worse than the simulation!)
simulation-based approaches are very versatile and useful – they can often help when few other things can!

Ling 201, session 01: intro to statistical thinking

Introduction

The scientific method

A few self-evident objectives of empirical scientific inquiry

But why use statistics for this?

Three absolutely central notions

Pitfalls you can avoid with proper quantitative analysis

Two English verbs verb1 and verb2

Two English verbs verb1 and verb2

Avoiding complete surprises 1

Avoiding complete surprises 2

Caveats: note, however

The phases of empirical quantitative studies

The phases of an empirical study

Phase 1 and 2: the notion of variables

Phase 1 and 2: variable types, part 1

Phase 1 and 2: variable types, part 2

Phase 1 and 2: variable types, practice

Phase 1 and 2: variable types, practice

Phase 2: What are hypotheses?

Phase 2: Kinds of hypotheses

Phase 2: Operationalization 1

Phase 2: Operationalization 2

Phase 2: an example

Phase 3: Data storage rules

Phase 3: Data storage (wrong)

Phase 3: Data storage (right)

Phase 3: Data storage: direct comparison

The logic of hypothesis testing

The scientific method

Coin tossing, scenario 1

Tossing a coin just 3 times

Tossing a coin more often

Coin tossing, scenario 2

Tossing a coin just 3 times

Tossing a coin more often

Lessons to be learned, part 1

Lessons to be learned, part 2

Phase 4, evaluation and interpretation

Choosing a method/test, part 1

Choosing a method/test, part 2

Significance testing (again)

Effect sizes

Approaching significance with simulation

For 20 nouns, you measured …

How do we determine whether that effect e is significant?

RT ~ IMAGEABILITY (randomized 1)

RT ~ IMAGEABILITY (randomized 2)

RT ~ IMAGEABILITY (randomized 3)

We need this much more often …

But how do we evaluate this?

How well does this work?

Two English verbs verb₁ and verb₂

Two English verbs verb₁ and verb₂