Ling 201, session 01: intro to statistical thinking

UC Santa Barbara

JLU Giessen

05 Jan 2025 12-34-56

Introduction

The scientific method

Figure 1: Scientific method

A few self-evident objectives of empirical scientific inquiry

  • Description, answering the question “what happens/happened?”
  • explanation, answering the question “why does x happen?”
  • prediction, answering the question “what will happen with x if …?”
  • control, answering the question “how can x be influenced?”

But why use statistics for this?

  • To describe, explain, and predict
    • objectively
    • precisely
    • comparably
    • concisely
  • to cope with variability and to generalize: different samples even from the same population will yield different results
  • thus, we need to be able to
    • quantify this variability
    • separate random from systematic/meaningful variability
  • to assess the robustness of one’s generalizations

Three absolutely central notions

  • Objectivity: independence of personal opinions
  • reliability: precision (in the sense of ‘re-test reliability’)
  • validity: one measures what one wants to measure; in a sense, this is probably the most important one

Pitfalls you can avoid with proper quantitative analysis

Two English verbs verb1 and verb2

  • A published study discussed the complementation preferences verb1 and verb2 with regard to two grammatical patterns on the basis of the following data:
addmargins(example_1 <-matrix(c(295, 131, 104, 35), ncol=2,
   dimnames=list(VERB=1:2, PATTERN=1:2)))
     PATTERN
VERB    1   2 Sum
  1   295 104 399
  2   131  35 166
  Sum 426 139 565
    PATTERN
VERB    1    2
   1 0.74 0.26
   2 0.79 0.21
  • conclusion drawn from this: “[c]omparing the postverbal elements in the two verbs, we can see that the proportion of [pattern1] for [verb2] is higher than for [verb1]” …
  • yes, 79% > 74%, but a certain statistical test would have shown that the distribution is not significantly different from chance:

    Pearson's Chi-squared test

data:  example_1
X-squared = 1.5679, df = 1, p-value = 0.2105
  • thus, with this test, the author would have avoided making an incorrect overgeneralization.

Two English verbs verb1 and verb2

  • Another study on two English verbs verb1 and verb2 discussed their complementation preferences with regard to 5 kinds of XPs on the basis of the following data:
addmargins(example_2 <- matrix(c(302,73, 8,0, 145,5, 19,3, 8,0), ncol=5,
   dimnames=list(VERB=1:2, PATTERN=c("NP", "PP", "VP", "AdjP", "AdvP"))))
     PATTERN
VERB   NP PP  VP AdjP AdvP Sum
  1   302  8 145   19    8 482
  2    73  0   5    3    0  81
  Sum 375  8 150   22    8 563
  • “we find that (a) [verb1] is more common before noun-phrases than before other constituents” …
  • yes, 302 is largest figure in the first row, or even the whole table, but the focus of much of the study was on verb1 vs. verb2, and compared to verb2, verb1 actually disprefers to occur before NPs (as shown by residuals):
    PATTERN
VERB    NP    PP    VP  AdjP  AdvP
   1 -1.06  0.44  1.46  0.04  0.44
   2  2.59 -1.07 -3.57 -0.09 -1.07
  • Thus, with this test, the author would have avoided their oversight.

Avoiding complete surprises 1

Figure 2: A correlation between two variables XX and YY

Avoiding complete surprises 2

Figure 3: A correlation between two variables XX and YY, controlled for a 3rd variable FF

Caveats: note, however

  • Statistics don’t provide content –- the researcher does
  • statistics are only useful to the extent that the researcher has been successful
    • in operationalizing his variables appropriately
    • eliciting/collecting the data correctly
    • choosing the right statistical technique

The phases of empirical quantitative studies

The phases of an empirical study

  • reconnaissance, leading to variables
  • hypotheses (text and statistical forms)
  • data collection ((operationalizations of) variables)
  • evaluation of hypotheses given the data with
    • effect sizes
    • graphs
    • significance tests (p-values)

Phase 1 and 2: the notion of variables

  • Variables
    • are measurable properties or characteristics of an item
    • vary across different items, where “items” are the individual measurements of the ‘thing of interest’ (see below); items can be people, events (words, utterances, …)
  • non-linguistic examples:
    • annual income, number of children, IQ, …
    • party someone voted for in the last election, hair color, marital status, …
  • linguistic examples:
    • reaction time to a word, length of a word, …
    • animacy of a subject noun phrase: human (Peter) vs. animate (the cat) vs. inanimate concrete object (the table) vs. abstract (time), …
  • note: we can/need to decide on a resolution. For annual income:
    • numbers: the exact amount? the amount rounded to full US$?
    • ranked classes: ‘negative’, 0-30,000, 30,001-60,000, 60,001-100,000, 100,001-?
    • categories: none vs. any? Or above average vs. more than/above average?

Phase 1 and 2: variable types, part 1

  • Variables can be distinguished in terms of their information value
    • categorical: different values → different properties
    • ordinal: categorical + different values → different ranks
    • numeric: categorical + ordinal + different values → sizes of differences
  • here are results of a fictitious results of an Olympic 100m dash – what is the information level of each variable in a column?
TIME RANK NAME NUMBER MEDAL
9.86 1 S. Davis 453473 1
9.91 2 J. White 563456 1
10.01 3 S. Hendry 756675 1
20.02 4 C. Lewis 585821 0
  • TIME: num, RANK: ord, NAME/NUMBER: cat, MEDAL: depends

Phase 1 and 2: variable types, part 2

  • Variables can be distinguished in terms of their role in an investigation
    • response/dependent: the variable whose values/behavior/variation we want to explain
    • predictor/independent: often the assumed cause for the behavior of the response

  • confounds (controlled, accounted for, or residualized out)
  • moderators (accounted for by interactions w/ add. variables)
  • colliders (accounted for differently)

Phase 1 and 2: variable types, practice

  • In the following non-linguistic examples of text hypotheses, what is the response, what is the predictor, and what are the variables’ information values?
    • people with a univ degree are more intelligent than people without one
    • response: IQ (num) ~ predictor: HASUNIVDEGREE (cat): no vs. yes
    • men are better at parking than women
    • response: PARKINGSKILL (?) ~ predictor: SEX/GENDER (cat): female vs. male

Phase 1 and 2: variable types, practice

  • In the following linguistic examples of text hypotheses, what is the response, what is the predictor, and what are the variables’ information values?
    • in essays, non-native speakers make more mistakes than native speakers
    • response: NUMBERofMISTAKES (num)~ predictor: SPEAKERTYPE (cat): nns vs. ns
    • subjects are shorter than objects
    • response: LENGTH (num) ~ predictor: GRAMREL (cat): object vs. subject

Phase 2: What are hypotheses?

  • What are hypotheses? One definition:
    • universal statements (going beyond a singular event)
    • implicit structure of a conditional sentence
      • if [predictor] …, then [response] …
      • the more/less [predictor] …, the more/less [response] …
    • potentially falsifiable
    • empirically testable
  • maybe most useful definition: a statement postulating a distribution of one or more response variables
  • hypotheses come in different kinds

Phase 2: Kinds of hypotheses

  • text hypotheses vs. statistical hypotheses (→ operationalization)
  • alternative hypothesis H1: a statement postulating
    • a particular distribution of a (response) variable (goodness-of-fit)
    • a relation between 1+ predictors & 1+ response variables (independence/difference(s))
      • stipulating some difference, but not its direction: non-directional/2-tailed
      • e.g., subjects and objects differ in their lengths
      • stipulating a difference and its direction: directional/1-tailed
      • e.g., subjects are shorter than objects
  • null hypothesis H0:
    • the logical counterpart to H1: an alternative hypothesis with not in it

Phase 2: Operationalization 1

  • Operationalization: the step from text hypotheses to statistical hypotheses
    • step 1: phrasing the variables in the text hypotheses such that they involve numbers
    • step 2: choose a statistical measure to be applied to those numbers
  • non-linguistic examples
    • parking performance
    • physical fitness
    • financial wealth
  • linguistic examples
    • the knowledge of a foreign language
    • the lengths of subjects and objects

Phase 2: Operationalization 2

  • Operationalization: the step from text hypotheses to statistical hypotheses
    • step 1: phrasing the variables in the text hypotheses such that they involve numbers
    • step 2: choose a statistical measure to be applied to those numbers
  • most frequent statistical measures:
    • counts/frequencies
    • averages/means
    • correlations
    • distributions and dispersions
  • what statistic to use for the lengths of subjects & objects?
    • summed total of the above (i.e. counts/frequencies)?
    • means of the above (i.e. averages/means)?

Phase 2: an example

  • Imagine you have the following alternative text hypothesis: “subjects are shorter than objects in English”
    • what’s the corresponding null hypothesis?
    • “subjects are not shorter than objects in English”
  • what variables are involved?
    • response: LENGTH (numeric) ~ predictor: GRAMREL (binary/categorical)
  • how to operationalize them?
    • LENGTH: let’s use length in words
    • GRAMREL:
      • object: the NP that is the ‘target’ of the action of a transitive verb and could become the subject of the sentence upon passivization
      • subject: the NP determining verbal morphology/agreement and that, prototypically, denotes the agent of the action denoted by the verb
  • what statistic to use?
    • average length of all objects vs. average length of all subjects

Phase 3: Data storage rules

  • Imagine you wanted to study this using corpus data
  • imagine you collected the following data set:
    • The younger bachelors ate the nice little cat
    • He was locking the door
    • The quick brown fox hit the lazy dog
  • rule: store the data in the so-called case-by-variable format:
    • each data point (i.e., measurement of the response variable) has a row on its own
    • every variable or every other characteristic of a data point has a column on its own
    • the very first row contains the names of all variables (header)
    • missing data are marked as NA –- do not use empty cells!
    • do not use numbers for categorical variables

Phase 3: Data storage (wrong)

Table 1: Terribly wrong format
SENTENCE SUBJ OBJ
The younger bachelors ate the nice little cat 3 4
He was locking the door 1 2
The quick brown fox hit the lazy dog 4 3
  • remember: each data point should have a row on its own
  • remember: every variable should have a column on its own
  • how many data points? 6, but …
    • … each row has 2 data points (of LENGTH), not 1
  • how many variables? 2: LENGTH and GRAMREL, but …
    • each of the columns 2 and 3 represents the levels of a variable (GRAMREL), not a variable

Phase 3: Data storage (right)

  • Something like this would be the correct format:
Table 2: Correct format
CASE ITEM/SENTENCE LENGTH GRAMREL
1 The younger bachelors ate the nice little cat 3 subj
2 The younger bachelors ate the nice little cat 4 obj
3 He was locking the door 1 subj
4 He was locking the door 2 obj
5 The quick brown fox hit the lazy dog 4 subj
6 The quick brown fox hit the lazy dog 3 obj
  • how many variables? 2, and that’s the two main columns on the right
  • how many data points? 6, and that’s how many rows we have

Phase 3: Data storage: direct comparison

The logic of hypothesis testing

The scientific method

  • The logic of statistical testing is that of hypothesis falsification:
    • one does not prove that one’s own H1 is correct
    • one ‘proves’ that the H0 is wrong, which means one’s H1 is right
  • steps:
    • one defines a significance level pcritical, which quantifies how quickly one will reject H0 / accept H1
    • one computes the effect e observed in one’s data (using the statistic from the statistical hypothesis)
    • one computes the probability of error p how likely it is to find e if H0 is correct
    • decision
      • if p<pcritical, one rejects H0 and accepts H1
      • if ppcritical, one must stick to H0 and cannot accept H1

Coin tossing, scenario 1

  • You and I play a game, tossing a coin 100 times: heads: $1 for me; tails: $1 for you
  • your hypotheses:
    • H0: both players are honest: pheads=ptails=0.5
    • H1: STG is not honest: pheads>0.5 and ptails<0.5
  • the significance level is set to 0.05
  • now, how often do you have to lose before you begin to accuse me of cheating a.k.a. accepting H1?
    • when you lose 51 times?
    • when you lose 55 times?
    • when you lose 59 times?
  • what are you doing? You’re looking at an effect e (the result STG: 3 vs. you: 0, i.e. your losses) and are determining when e becomes too unlikely to still believe in H0

Tossing a coin just 3 times

  • You set the significance level: pcritical=0.05
  • we play, you lose 3 times out of 3: the effect e is 3:0
Toss 1 Toss 2 Toss 3 Heads Tails presult
heads heads heads 3 0 0.125
heads heads tails 2 1 0.125
heads tails heads 2 1 0.125
heads tails tails 1 2 0.125
tails heads heads 2 1 0.125
tails heads tails 1 2 0.125
tails tails heads 1 2 0.125
tails tails tails 0 3 0.125
  • probability of error p=0.125
  • decision: p>pcritical: you must stick to H0

Tossing a coin more often

Coin tossing, scenario 2

  • You and I play a game, tossing a coin 100 times: heads: $1 for me; tails: $1 for you
  • an independent observer’s hypotheses:
    • H0: both players are honest: pheads=ptails=0.5
    • H1: at least one player is not honest: pheads>0.5 or pheads<0.5
  • the significance level is set to 0.05
  • now, how often does one of us have to lose before the independent observer begins to accuse the other of cheating a.k.a. accepting H1?
    • when someone loses 51 times?
    • when someone loses 56 times?
    • when someone loses 61 times?
  • what is the independent observer doing? Looking at an effect e (the results someone: 3 vs. someone else: 0, i.e. someone’s losses) and determining when e becomes too unlikely to still believe in H0

Tossing a coin just 3 times

  • An independent observer sets the significance level: pcritical=0.05
  • we play, one of us (you) loses 3 times out of 3: the effect e is 3:0
Toss 1 Toss 2 Toss 3 Heads Tails presult
heads heads heads 3 0 0.125
heads heads tails 2 1 0.125
heads tails heads 2 1 0.125
heads tails tails 1 2 0.125
tails heads heads 2 1 0.125
tails heads tails 1 2 0.125
tails tails heads 1 2 0.125
tails tails tails 0 3 0.125
  • probability of error p=0.125 (from 3:0) + 0.125 (from 0:3) = 0.25
  • decision: p>pcritical: the observer must stick to H0

Tossing a coin more often

Lessons to be learned, part 1

  • Lesson 1 is about distributions and parametric testing:
  • in this case of binomial trials, with increasing sample sizes,
    • we obtain a bell-shaped normal distribution
    • even if the ‘input probability’ is not normal
  • thus, if the sample sizes are large enough and the distribution looks like one we can describe easily, then …
  • we can use a parametric/asymptotic test – but only then!

Lessons to be learned, part 2

  • Lesson 2 is about alternative hypotheses: there are
    • directional/one-tailed alternative hypotheses:
      • postulate an effect, a difference, a correlation,
      • and its direction (above, you)
    • non-directional/two-tailed alternative hypotheses:
      • postulate an effect, a difference, a correlation,
      • but not its direction (above: the independent observer)
  • prior knowledge is rewarded: the former was/are easier to accept
  • but where are those p-values coming from?

Phase 4, evaluation and interpretation

Choosing a method/test, part 1

  • What kind of study is being conducted
    • descriptive, exploratory, hypothesis-generating
    • hypothesis-testing
  • how many and what kinds of variables are involved?
    • 1 response (goodness-of-fit tests)
    • 1 response & 1 predictor (monofactorial test for independence or differences)
    • 1 response & 2+ predictors (multifactorial analyses)
    • 2 responses (multivariate analyses)
  • are data points related such that you can associate them with each other in a meaningful principled way?
    • no: tests for independent samples
    • yes: tests for dependent samples
    • the latter are usually more powerful

Choosing a method/test, part 2

  • What is the statistic of the dependent variable in the statistical hypothesis?
    • counts/frequencies, → often chi-squared tests
    • distributions → often Kolmogorov-Smirnov test
    • averages/means, → often t-tests
    • dispersions → often F-tests
    • correlations, → often r or ρ or τ
  • What does the distribution of the data look like?
    • normal: often leads to parametric tests
    • non-normal: often leads to non-parametric, simulation, or exact tests
  • how big are the samples to be collected?
    • <30: often a risk to normality assumptions
    • ≥30: often supporting normality assumptions

Significance testing (again)

  • Your results section should usually include
    • the observed effect e
    • some significance results from some test(s)
    • how both these aspects of your results relate to your hypotheses
  • but again: the p-value indicates how likely the observed result is given the H0 – nothing else
Recall that the standard p-value required in the humanities and social sciences
is 0.05. [...] What does this statistical significance mean? It means that
there is at least a 95% chance that the null hypothesis is *incorrect*.
  • this is completely wrong:
    • this author: p is p(H0 = FALSE | data)
    • actually: p is p(data | H0 = TRUE)
  • often, people distinguish ‘levels of significance’:
    • p<0.001 (highly significant) vs. 0.01>p≥0.001 (very significant) vs. 0.05>p≥0.01 (significant)
    • 0.1>p≥0.05: marginally significant – stupid, don’t use this

Effect sizes

  • As mentioned above, your results should also include effect sizes
  • effect sizes are correlated with p-values, but not deterministically so: often
    • strong effects will be significant, and
    • weak effects will be insignificant
  • but,
    • given large sample sizes, even very weak effects can be significant
    • given large variability, even strong effects can be insignificant
Learner of-gen s-gen Sum
Chinese 20 15 35
German 15 20 35
Sum 35 35 70
   p-value odds ratio
    0.2320     1.7778 
Learner of-gen s-gen Sum
Chinese 200 150 350
German 150 200 350
Sum 350 350 700
   p-value odds ratio
    0.0002     1.7778 
  • you must keep significance and effect size separate in your head
    • significance: how likely is the effect when ‘in reality, there’s nothing’?
    • effect size: how big/strong is the effect regardless of whether it’s random?

Approaching significance with simulation

For 20 nouns, you measured …

  • a predictor IMAGEABILITY: whether one can imagine/visualize the referent of the noun (n: ‘no’ vs. y: ‘yes’)
  • a response RT: a reaction time score ranging from 1 (fastest) to 20 (slowest)
  • check out this nearly perfect correlation:
Figure 4: The correlation between RT and IMAGEABILITY

How do we determine whether that effect e is significant?

  • The observed effect e (n-y) is 14 minus 7 = 7 but H0 hypothesizes an effect of 0
  • how about we generate relevant H0 data and check how the observed effect e compares to those H0 data?
  • relevant H0 data
    • have the same IMAGEABILITY frequencies of n and y (10 each), &
    • have the same values of RT, but
    • are somehow random and, thus, compatible with H0 – how?
  • simple: we destroy the association of RT ~ IMAGEABILITY (n & slow / y & fast) by randomly reordering the values of the predictor IMAGEABILITY!

RT ~ IMAGEABILITY (randomized 1)

set.seed(1); d_rand <- data.frame(RT=d$RT, IMAGEABILITY=sample(d$IMAGEABILITY))
Figure 5: A correlation between RT and randomized IMAGEABILITY 1

RT ~ IMAGEABILITY (randomized 2)

d_rand <- d_rand <- data.frame(RT=d$RT, IMAGEABILITY=sample(d$IMAGEABILITY))
Figure 6: A correlation between RT and randomized IMAGEABILITY 2

RT ~ IMAGEABILITY (randomized 3)

d_rand <- d_rand <- data.frame(RT=d$RT, IMAGEABILITY=sample(d$IMAGEABILITY))
Figure 7: A correlation between RT and randomized IMAGEABILITY 3

We need this much more often …

  • Let’s generate not 3, but 100,000 random H0 distribution, i.e. 100,000 effects e, …
Figure 8: The first 10 of 100,000 random distributions
  • … which should come down, i.e. average, to 0

But how do we evaluate this?

Figure 9: A histogram of the H0 correlations between RT and IMAGEABILITY
  • We can represent all the H0 effects e1-100,000, e.g. in a histogram
  • we can add a vertical line to the histogram that represents the actual observed effect e of 7
  • we can count how often we get a value of 7 or higher in the H0 data and …
  • … express that as a percentage – that is p
  • here, p is 0.0034 – the observed difference of 7 between objects and subjects is significant(ly different from 0)

How well does this work?

  • Reminder: the p-value from the simulation is 0.0034
  • the ’gold standard p-value from an exact (!) t-test for independent samples is 0.00342, …
  • … which means that the simulation approach scores a nearly perfect result
  • what about the parametric t-test (according to Welch)? Its p-value is 0.00232, which is also pretty close (but worse than the simulation!)
  • what about the parametric t-test (according to Student)? Its p-value is 0.00227, which is also close (but worse than the simulation!)
  • simulation-based approaches are very versatile and useful – they can often help when few other things can!