1 The notion of correlation (again)

In some sense, this whole course/quarter will be on the notion of correlation, which is why I want to spend some time on reminding us all on what that is/means.

Definition: Two variables A and B are correlated

  • if knowing the value/range of A makes it easier to ‘predict’ (better) the value/range of B than if one doesn’t know the value/range of A;
  • if knowing the value/range of B makes it easier to ‘predict’ (better) the value/range of A than if one doesn’t know the value/range of B.

Here is an example where knowing A (or B) does not help ‘predicting’ B (or A):

##
##  Pearson's product-moment correlation
##
## data:  A and B
## t = 0.16863, df = 98, p-value = 0.8664
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1799881  0.2127386
## sample estimates:
##        cor
## 0.01703215
##
## Call:
## lm(formula = B ~ A)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.50405 -0.23895  0.00904  0.21547  0.46846
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.50851    0.05974   8.512 2.03e-13 ***
## A            0.01730    0.10260   0.169    0.866
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2732 on 98 degrees of freedom
## Multiple R-squared:  0.0002901,  Adjusted R-squared:  -0.009911
## F-statistic: 0.02844 on 1 and 98 DF,  p-value: 0.8664

Why does knowing A (or B) not help ‘predicting’ B (or A)? Because, for instance, no matter which value range of A you pick, you can’t predict B very well (and vice versa):

And we can exemplify that easily with a regression line as well:

By contrast, here is an example where knowing A (or B) does help ‘predicting’ B (or A):

##
##  Pearson's product-moment correlation
##
## data:  A and B
## t = 48.903, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9705438 0.9866034
## sample estimates:
##       cor
## 0.9801194
##
## Call:
## lm(formula = B ~ A)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.100809 -0.047789  0.001807  0.043093  0.093692
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.10170    0.01195   8.512 2.03e-13 ***
## A            1.00346    0.02052  48.903  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05463 on 98 degrees of freedom
## Multiple R-squared:  0.9606, Adjusted R-squared:  0.9602
## F-statistic:  2391 on 1 and 98 DF,  p-value: < 2.2e-16

Why does knowing A (or B) help ‘predicting’ B (or A)? Because, for instance, knowing the value range of A makes you predict B better (and vice versa):

And we can exemplify that easily with a regression line as well:

Now, here’s an example of a correlation with the same slope, but more noise: