Description vs prediction

Think of the lines we drew in class as tools which do two things:

  1. Describe the data we have, and
  2. Predict new data we might collect

In statistical language, the gaps between our line and the data points are called residuals.

We can distinguish:

  1. Residuals for the data we have now (how well does the line describe the data).
  2. Residuals for new data we collect after drawing the line (how well does the line predict).

Another common way to refer to residuals is as the error in a model. That is, the line describes a model or idealised relationship between variables.

The residuals give an estimate how how much error there will be in our predictions when we use the model.

You should have found that the total length of the residuals for the curved lines is smaller than the residuals for the straight line. If this isn’t the case, check your measurements.

This is because a curved line will be better description of the data you had when fitting it, but will be a poorer predictor of new data.

This is why fitting straight lines is such a common technique. We know that the line doesn’t describe the data we have that well, but we hope it will be a better predictor of future events than anything else.

Worse is better

The reason curved lines were worse for new data for this is that there is a tradeoff going on:

  • If we draw a curved line, to get close to the original data points, then our lines reflect peculiarities in the sample. That is, our lines are drawn to accomodate random variation in this specific sample.

  • Because these random variations aren’t repeated in new samples, the lines fit less well when we swap datasets.

In fact, because the straight line (mostly) ignores this sample variation it can be a better estimate of the real relationship in the population as a whole1.

So worse is sometimes better: Because they were simpler, straight lines were a worse fit for our original dataset. But they were a better predictor in new random samples.

This is an example of overfitting. By overfitting, we mean that a model (in this case the line) is too closely matched to a particular sample, and so might not be a good predictor of the population as a whole.

Over-fitting is the reason we prefer simpler models (lines) to more complicated ones.

Squared residuals

In our hand-fitting example we measured the absolute difference (in mm) from the triangular points to our lines.

However it’s easier to program computers to find the line with the smallest squared residuals. For this and various other reasons most analyses fit lines based on minimising the square of the residual between the datapoints and the line.

In graphic terms, this means the computer is finding the line where the area of the red box is minimised (on the right), rather than the length of the red line (on the left):

You don’t need to worry about this now, but think back to this graphic if you see terms like “sums of squares” or “squared residuals” in future: it relates to the idea that regression (and Anova) try to fit a line which makes the red box smaller for all the data points in the sample.

The shaded area when using geom_smooth

If you use geom_smoth with method=lm you get a grey shaded area around the line.

mtcars %>%
  ggplot(aes(wt, mpg)) +
    geom_point() +
    geom_smooth(method=lm)

The shaded area shows the standard error of the line of best fit. This is an estimate of how confident we are about the predictions the line makes. This video explains it quite well: https://www.youtube.com/watch?v=1oHe1a3JqHw. When he uses the greek letter he just means “slope”.

If you want to hide it, you can add: se=FALSE:

mtcars %>%
  ggplot(aes(wt, mpg)) +
    geom_point() +
    geom_smooth(method=lm, se=FALSE)

Why don’t we always use real data?

Real data is often quite complicated, and it is sometimes easier to simulate data which illustrates a particular teaching point as clearly as possible. It also lets us create multiple examples quickly.

It is important to use real data though, and this course includes a mix of both simulated and real data.

Reducing variance

When we introduced multiple regression we said there were (at least) 3 benefits:

  1. Reducing variance/improving prediction
  2. Testing moderation or interaction and
  3. ‘Controlling for’ bias in observational data.

In the workshop we focussed on testing moderation, but this section demonstrates how multiple regression can be used to reduce variance.


Data from a clinical trial of Functional Imagery Training (???, FIT) run by Jackie Andrade and others at the University of Plymouth are available at https://zenodo.org/record/1120364/files/blind_data.csv. The study aimed to help obese people to lose weight.

In this file, group represents the treatment group (FIT=2; Motivational Interviewing=1). The kg1 and kg3 variables represent the patients’ weights in kilograms before and after treatment

We can load the data like this:

library(tidyverse)
fitdata <- read_csv('https://zenodo.org/record/1120364/files/blind_data.csv')

A simple t-test shows that, at follow-up, the difference between FIT and MI groups was not statistically significant.

t.test(kg3 ~ group, data=fitdata)

    Welch Two Sample t-test

data:  kg3 by group
t = 1.4944, df = 109.48, p-value = 0.138
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.442775 10.288570
sample estimates:
mean in group 1 mean in group 2 
       88.45849        84.03559 

Note that the difference between groups was 88.46 - 84.04 = -4.42, and the t statistic was 1.49.

We can do exactly the same test using regression:

lm(kg3 ~ group, data=fitdata)

Call:
lm(formula = kg3 ~ group, data = fitdata)

Coefficients:
(Intercept)        group  
     92.881       -4.423  

Explanation of the command and output: We used lm with the formula "kg ~ group". This told R to fit a model with one categorical predictor, group. In the output the group coefficient is -4.42 — the same as the difference we saw with the t.test

We can also calculate the t and p statistics for this difference using lm with the tidy command:

library(broom)
lm(kg3 ~ group, data=fitdata) %>% tidy
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    92.9       4.76     19.5  1.77e-37
2 group          -4.42      2.97     -1.49 1.39e- 1

Explanation of the code: We first loaded the broom package because it contains a function, tidy, that we want to use. On the next line we ran the same model as before, but now piped it to the tidy function. Tidy shows us the model coefficients again. However it makes them available as a small dataframe and displays them alongside the statistic (the t value) and p value. Note that the t value in the statistic column is exactly the same = -1.49.

A more powerful test

So far so good: but we heard above that including extra predictors to a multiple regression can add statistical power.

In this case, the strongest predictor of weight at follow-up was weight on entry to the trial, so let’s include that in a new regression model:

lm(kg3 ~ group + kg1, data=fitdata) %>% tidy
# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    7.91     3.00        2.64 9.55e- 3
2 group         -5.96     0.912      -6.54 2.09e- 9
3 kg1            0.963    0.0296     32.5  6.59e-58

Explanation of the command: We ran the same model, but this time added "+ kg1" to the formula. This added kg1 as a second predictor, after group. In the output we can see the estimate for group and the statistic (t value) have changed: The difference between the two groups is now 0.96 and the t statistic is -6.54. This difference is also now statistically significant, p < .001.


Why does this work? The biggest source of unexplained variance in between groups studies is almost always the variation between individuals. People just vary quite a lot!

By including each participants’ baseline weight (before treatment) they effectively become their own control group. This means there is much less variation to explain: Instead of having to explain all the variation between individuals, we are now explaining only variation in the changes within individuals, over the course of the trial. And the group that participants were randomised to does a good job of that, in this study, because FIT turned out to be very effective.


This was perhaps the simplest possible example of using an extra predictor to reduce unexplained variance. However there are many other examples which don’t involve using a baseline measure from the same participant. This paper gives many more examples of supression effects, although it is technical in places. Tzelgov and Henik (1991) gives gives many examples from Psychology.

Confounding is complicated

Getting rid of all confounding is hard, and there are no simple statistical solutions - thinking is required! For a very interesting and fairly accessible introduction to this material see Pearl and Mackenzie (2018). This isn’t required reading for this course, but would be very helpful if you plan to do a non-experimental project using quantitative methods in stage 4.

Choosing between predictors using multiple regression is normally a bad idea

Choosing between possible predictors with multiple regression (e.g. by looking at the p values of the coefficients) is normally a very bad idea. When people say that we can use multiple regression to choose between different sets of predictors, they are often mistaken, and the situation is much more complicated than is normally acknowledged.

In particular, automated methods to discover which are the ‘best’ or most important predictors have long been known to be a bad idea. One influential paper concluded we should: “treat all claims based on stepwise algorithms as if they were made by Saddam Hussein on a bad day with a headache having a friendly chat with George Bush” (Derksen and Keselman 1992). This is probably good advice.

The main problem is that standard regression methods assume that predictors are measured perfectly, without any errors. However this isn’t the case: most measurements do include errors, and with psychological constructs these can sometimes be quite large. When errors are present, and especially when the predictors are themselves correlated, then simple sampling variation can produce quite large changes in the estimates of ‘which predictor is “best”’.

Benefits of multiple regression

Reducing variance/Improving prediction

If we include extra variables we may get a better prediction than if we used a simple correlation, although this isn’t guaranteed. An example of this is in clinical trials where patients have been randomised to different interventions but nonetheless vary in characteristics which predict their outcome. These are called prognostic variables, and in clinical trials for depression would include the number of previous episodes of depression, or the number of other diagnoses a patient has (e.g. depression plus a personality disorder, or depression plus anxiety).

By including measures of prognostic variables, researchers may increase the statistical power of their study; that is, the ability to detect a difference between the interventions.

It’s important to note that this benefit still applies when patients have been randomised to interventions, and even if the study groups are balanced with respect to the prognostic variables. That is, both interventions might have the same number of patients with many previous episodes of depression, and the same number of comorbidities.

In this case the benefit of multiple regression comes because we are explaining unexplained variability in patient outcomes with the prognostic variables. By reducing the unexplained variability in the outcome, the difference between intervention groups can become clearer. You can think of controlling for prognostic variables as eliminating noise or static from an TV picture: by doing this it’s easier to see what’s going on.

Testing moderation/interactions

In our causal diagrams we found cases where the relationship between two variables might be changed by a third variable. For example, if we hypothesised that the relationship between expertise and earnings might be different for men and women.

Multiple regression lets us test if that is actually the case. By adding multiple predictors, and running a model which allows the outcome to change when these predictors combine or interact we can judge whether the relationship between predictor and outcome really is altered by the third variable.

This is the focus of the worksheet below.

‘Controlling for’ bias in observational data. {controlbias}

As I hinted in the workshop, sophisticated analysis of causal diagrams lets us identify cases where confounding might be taking place, as well as other problems such as collider biases.

Used wisely, in conjunction with causal diagrams, multiple regression can help estimate the causal relations between variables using observational data.

A good example is the establishment of a link between smoking an cancer (see Pearl and Mackenzie 2018, chap. 4 and especially chapter 5). It may be hard to imagine now, but the tobacco industry claimed for decades that associations between smoking and lung cancer were confounded — caused by other, unobserved variables. Careful work by epidemiologists over many years measured and included many possible confounding variables in analyses of lung cancer deaths, and found that this was not the case. By including possible confounders this observational research eventually persuaded policy makers that smoking really does cause lung cancer, and in the process saved many millions of lives.

However, making causal estimates from non-experimental data is fraught with potential problems. You can read more about this here.

You will also sometimes see people claim that multiple regression provides a way of choosing between different possible predictors of an outcome: that is, deciding which predictor is the best or most important. This is basically untrue; see here for why.

Derksen, Shelley, and Harvey J Keselman. 1992. “Backward, Forward and Stepwise Automated Subset Selection Algorithms: Frequency of Obtaining Authentic and Noise Variables.” British Journal of Mathematical and Statistical Psychology 45 (2): 265–82.

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.

Tzelgov, Joseph, and Avishai Henik. 1991. “Suppression Situations in Psychological Research: Definitions, Implications, and Applications.” Psychological Bulletin 109 (3): 524.


  1. It’s not always true, but it’s a good rule of thumb.↩︎