Predictive modeling in linguistics

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

15 Jul 2023 12-34-56

1 Fundamentals of regression modeling, part 1

In some sense, this whole course will be on the notion of correlation, which is why I want to spend some time on reminding us all on what that is/means.

Definition: Two variables A and B are correlated

  • if knowing the value/range of A makes it easier to ‘predict’ (better) the value/range of B than if one doesn’t know the value/range of A;
  • if knowing the value/range of B makes it easier to ‘predict’ (better) the value/range of A than if one doesn’t know the value/range of B.

Here is an example where knowing A (or B) does not help ‘predicting’ B (or A):


    Pearson's product-moment correlation

data:  A and B
t = 0.16863, df = 98, p-value = 0.8664
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1799881  0.2127386
sample estimates:
       cor 
0.01703215 

Why does knowing A (or B) not help ‘predicting’ B (or A)? Because, for instance, no matter which value range of A you pick, you can’t predict B very well (and vice versa):

And we can exemplify that easily with a regression line as well:

By contrast, here is an example where knowing A (or B) does help ‘predicting’ B (or A):


    Pearson's product-moment correlation

data:  A and B
t = 48.903, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9705438 0.9866034
sample estimates:
      cor 
0.9801194 

Why does knowing A (or B) help ‘predicting’ B (or A)? Because, for instance, knowing the value range of A makes you predict B better (and vice versa):