In brief

Multiple regression is a technique which can describe the relationship between one outcome and two or more predictors.

We can also use multiple regression to describe cases where two variables interact. That is, when the effect of one predictor is increased or decreased by another (this is called moderation, as we saw in the first session).

Multiple regression is important because it allows us to make more realistic models and better predictions. But — like any sharp tool — it should be used carefully. If our statistical model doesn’t match the underlying network of causes and effects, or if we have used a biased sample, we can be misled.

When evaluating our models we can ask two useful questions: First, how much of the variability in the outcome do my predictors explain? The \(R^2\) statistic answers this. Secondly: does the model make better predictions than just taking the average outcome? For this we can compute a BayesFactor.

Before you start

  • Make a new Rmd file to save your work in this session. Call it regression-multiple.rmd. Make sure you save it in the same directory (folder) as your data handling and regression rmd files (something like rmip_2021).

Causal diagrams

Think back to our first session on causes and effects. When we drew causal diagrams of our research question we found cases where there were:

We drew diagrams like this for those cases:



Or to give a concrete example:



When we say that effect modification (i.e. moderation) is taking place, this is really what we mean:



Benefits of multiple regression

When we have a causal diagram involving multiple variables there are three key benefits to using multiple regression:

  1. Reducing variance/Improving prediction: adding variables can increase precision and the statistical power of our predictions.

  2. Testing moderation or interactions: where the relationship between two variables might be changed by a third variable we can test for this using multiple regression (or Anova; as you’ll see later, Anova is just a special kind of linear model).

  3. ‘Controlling for’ bias in observational data: Used wisely, in conjunction with causal diagrams, multiple regression can help estimate the causal relations between variables using observational data (although we need to be very careful as we do this).

You can read more about each of these benefits here. However, our focus in using multiple regression in this module is to test for moderation.

Testing moderation

Moderation is when two predictor variables interact to cause an outcome.

If you have a moderation hypothesis — for example, that a relationship predictor and outcome might be different between groups — the first thing you should do is plot the data.

As an example, I use the studyhabits dataset we used previously from psydata We already know there is a link (in these data) between study hours and grades because we can see it in the plot (and we modelled it using lm in the first regression session):

library(psydata)

studyhabits %>%
  ggplot(aes(work_hours, grade)) +
  geom_jitter() + 
  geom_smooth(method=lm, se=F)

Interaction plots

We could also ask: “is this relationship the same for men and women?” To visualise the differences we can use colour:

studyhabits %>%
  ggplot(aes(work_hours, grade, color=female)) +
  geom_jitter() +
  geom_smooth(method="lm", se=F)

Explanation of the code and output I added color=gender to the ggplot function. This splits the data by gender. We can see that the slopes for men and women do look different.

This type of plot shows the difference in slopes between two (or more) categories and is called an ‘interaction plot’.

  • Use the code above to reproduce the coloured scatter-plot.
  • Work with a partner. Agree together:
    • What is the overall pattern of these simulated results?
    • Does extra time spent working benefit men and women equally?
  • Adapt the code to produce another plot with a different variable on the x axis. Are the slopes for men and women the same or different this time?
# use `work_consistently` as the x axis
studyhabits %>%
  ggplot(aes(work_consistently, grade, color=female)) +
  geom_jitter() +
  geom_smooth(method="lm", se=F)

Using lm with two predictors

To add more predictors to our model we can simply add them with a “+” symbol. For example:

lm(grade ~ work_hours + female, data = studyhabits)

Call:
lm(formula = grade ~ work_hours + female, data = studyhabits)

Coefficients:
(Intercept)   work_hours   femaleTRUE  
    38.8234       0.8719      -1.5761  

This model now has three coefficients:

  • (Intercept)
  • work_hours
  • femaleTRUE

These three coefficients describe two sloped lines:

  • The steepness of one slope: work_hours
  • The place on the y axis this slope starts from: (Intercept)
  • The difference in the position of the two slopes, which is constant across the x axis

The plot this model represents it looks like this:

Explanation: Adding female to the model allows the intercepts to vary between men and women, but the slopes are still fixed to the same value. The lines on the plot look a slightly better fit, but not much.

We need to allow for an interaction if we want our model to match the data.

Adding an interaction

To add an interaction we need to replace the “+” symbol with a “*” in our formula:

lm(grade ~ work_hours * female, data = studyhabits)

Call:
lm(formula = grade ~ work_hours * female, data = studyhabits)

Coefficients:
          (Intercept)             work_hours             femaleTRUE  work_hours:femaleTRUE  
               18.623                  1.795                 22.463                 -1.074  

Explanation: This time we have changed the formula and:

  • added female as a second predictor and
  • used the * symbol between work_hours and female
  • this allows the slope for work_hours to be different for men and women.

Description of the output: This time, we have 4 coefficients shown:

  • (Intercept)
  • work_hours
  • femaleTRUE
  • work_hours:femaleTRUE

Interpreting the coefficients: These coefficients have changed their meaning from the first model we ran. However we can still think of them as either points or slopes on the graph with fitted lines overlaid:

  • (Intercept) is the point for men, where work_hours = 0. When female = False this is a man, so we are looking at the red line, where it crosses zero on the x axis. Find this point on the plot now.
  • femaleTRUE is the difference between men and women, when work_hours = 0 (the difference between the blue and red lines, at the zero point on the x axis)
  • work_hours is the slope (relationship) between work_hours and grade for men (the steepness of the red line)
  • work_hours:femaleTRUE is the difference in slopes for work hours, for women. So this is the slope for women minus the slope for men: that is, the difference in steepness between the red and blue lines. It’s NOT the slope of the blue line. We need to add both slope coefficients to see that.

Compare the model output and plot which are repeated below:


Call:
lm(formula = grade ~ work_hours * female, data = studyhabits)

Coefficients:
          (Intercept)             work_hours             femaleTRUE  work_hours:femaleTRUE  
               18.623                  1.795                 22.463                 -1.074  

In pairs:

  1. For each of the 4 coefficients, agree if it represents a point or a slope
  2. Find each of the points on the plot (i.e. which coefficient is it)
  3. Compare the slope coefficients to the lines on the plot - can you explain which coefficient describes which slope?
  4. What would happen if the sign of each coefficient was reversed? E.g. if the coefficient was now a minus number? What would this mean for the plot?

Double check you understand how to interpret the work_hours:femaleTRUE coefficient. It’s very common for regression coefficients to represent differences in this way. But in this example it does mean we have to know both work_hours (the slope for men) and work_hours:femaleTRUE (the difference in slopes for men and women) to be able to work out the slope for women.

To test your knowledge: What is the slope for women in this model?

The answer is 0.72.

This is because the data is coded such that the work_hours slope is the coefficient for men. To get the slope for women we have to add:

  • the slope for men (1.79)
  • the difference in slopes between men and women (-1.07)

So the slope for women is: 1.79 + -1.07 = 0.72.

Prediction

This section is an optional extension exercise. It is not required for the PSYC519/719 assessment, but would be useful if you are planning on using these methods in your own research.

(If you haven’t completed it just now, re-run the code from the previous section and make sure you have a variable called third.model stored in your Environment pane in RStudio):

third.model <- lm(grade ~ work_hours * female, data = studyhabits)

Because there are now 4 coefficients in this third model, it’s more fiddly to make predictions by hand.

R’s augment function (in the broom library) will do it for us. By default, augment uses the regression model (the fitted lines) to make a prediction for each row in the original dataset. So, for example, we can say:

library(broom)
augment(third.model) %>% 
  # only show the first 4 columns
  select(1:4) %>% 
  # and the first 6 rows
  head(6) 
# A tibble: 6 x 4
  grade work_hours female .fitted
  <dbl>      <dbl> <lgl>    <dbl>
1    65         26 TRUE      59.8
2    57         26 TRUE      59.8
3    45         16 FALSE     47.3
4    70         33 TRUE      64.9
5    55         25 TRUE      59.1
6    50         20 TRUE      55.5

Explanation of the code:

  • We used augment to make predictions for all the observations in our dataset.
  • We use head(6) to show only the first 6 predictions
  • We used select(1:4) to show only the first 4 columns

Explanation of the output: R has made a prediction for grade for each person in the dataset, based on the coefficients from the model. Because we used head and round we only see the first 10 predictions, and the results are rounded (see what happens without for yourself).

Predicting for new data

Often, we don’t want predictions for the data in our sample: Instead we want to make predictions at particular values of the predictor variables.

For example, we might want to make predictions for the typical person, or specifically for a woman with a high or low number of work_hours.

To do this we need to:

  1. Create a new dataset where the predictors are set to the values we’re interested in.
  2. Give this new data to the augment function, as well as the fitted model.

We can make a new dataset like this:

typicalpeople <- tibble(female=c(TRUE,TRUE), work_hours=c(10, 20))

Explanation of the code: We use tibble to make a new dataframe. The c() command is short for ‘combine into a list’. So, this code creates a new dataframe with 2 rows, where female is always TRUE, and where work_hours is either 10 or 20.

We should check we got what we expected by looking at the new dataframe:

typicalpeople
# A tibble: 2 x 2
  female work_hours
  <lgl>       <dbl>
1 TRUE           10
2 TRUE           20

Then we can send this new dataframe to the augment again:

augment(third.model, newdata = typicalpeople)
# A tibble: 2 x 4
  female work_hours .fitted .se.fit
  <lgl>       <dbl>   <dbl>   <dbl>
1 TRUE           10    48.3   1.83 
2 TRUE           20    55.5   0.821

Explanation: We made a new dataframe containing only 2 rows. We passed this to the augment function along with our regression model (third.model). We got two predictions back: one for a woman who works 10 hours, and the other for a woman who works 20 hours. The .fitted column provides the prediction.

Now calculate predictions for:

  • a man who does 35 hours of work each week
  • a woman who does 22.5 hours of work each week

Additional (trickier) extension exercise:

  • Make a plot from a set of predictions

How good?

In describing how we fit lines to data, I explained that lm attempts to minimise the squared error of the predictions it makes.

For a single-predictor model this was quite easy to visualise:

R fits a line which minimises the area in the red squares (shown on the right in blue).

We can describe how good the model is at making predictions using the relative sizes of these red and blue squares.

  • Red squares: the variance in the outcome variable, \(y\)
  • Blue squares: the residual or unexplained variance, after fitting the model

The \(R^2\) statistic describes the proportion of the red squares eliminated by fitting the model, but this also works for models which have more than 1 predictor.

\(R^2\) (the statistic)

Note: the ‘R’ in \(R^2\) has nothing to do with the R software - it’s the name of the statistic—like a ‘t-test’ or a ‘p value’.

\(R^2\) tells us how much variance in the outcome is explained by the model.

Above, we ran a model to predict grades in the studyhabits dataset:

third.model <- lm(grade ~ work_hours * female, data = studyhabits)

To calculate \(R^2\) we can use the glance command in the broom package.

library(broom)

glance(third.model) %>% 
  # only show the first 2 columns
  select(1:2)
# A tibble: 1 x 2
  r.squared adj.r.squared
      <dbl>         <dbl>
1     0.221         0.213

Explanation of the output:

  • First we loaded the broom package.
  • Then, glance produced a dataframe of statistics relating to our model.
  • For the moment we only need the first two columns: r.squared and adj.r.squared.
  • For this model, adj.r.squared = 0.21, which means 21% of the variability in grade was explained by work_hours and female, and their interaction.

To report this in APA format, you would say:

To explain variation in participants’ grades we used multiple regression and included the following predictors: number of working hours, gender, and the interaction those variables. This model explained 21% of the variance in grades (adjusted \(R^2\) = 0.21).

The adj.r.squared value stands for ‘adjusted \(R^2\)’. The adjustment is made because as you add more predictors to a model it will always explain more variability — just by chance (and if you add too many predictors you risk over-fitting the data, as we saw in the last session).

The adjustment reduces \(R^2\) to account for over-fitting (to some degree).

  • The gap between regular and adjusted \(R^2\) will tend to grow as you add more predictors.

  • You should always use adj.r.squared in any reports or publications evaluating your model. Some people like to report both unadjusted and adjusted figures. I don’t mind, but I would always want to see at least the adjusted \(R^2\).

Run a model using the development dataset in which:

  • Life expectancy is the outcome.

  • GDP (national income) is the predictor.

  • Calculate \(R^2\) for this model.

  • How much of the variance in life expectancy can be explained by GDP?

# run the model
lm(life_expectancy ~ gdp_per_capita, data=development) %>% 
  # calculate some model statistics
  glance() %>% 
  # only show first 2 columns of output (all we need)
  select(1:2)
# A tibble: 1 x 2
  r.squared adj.r.squared
      <dbl>         <dbl>
1     0.341         0.340

Consolidation

The exercises below are to check you understand the materials, and give examples to talk through in detail with TARAs/staff.

Staff won’t discuss your analysis and interpretation of the rmipx dataset explicitly, but we can answer questions about your work on these datasets.

Use the fuel data to make this plot, and describe in words the pattern of the results:

  • What is the slope for the effect of engine size?

  • Why is this number so small when the slope on the graph looks fairly steep?

The coefficient for engine_size tells us how many additional MPG we expect for each addition ‘cc’ (cubic centimeter) of engine size. Each additional cc has a small effect, but engine_size in this sample ranges from 1170 to 7730, so it all adds up.

  • Does the engine_size slope apply to automatic or manual cars?

In this dataset manual cars are the ones where the automatic column is equal to false. This makes them the ‘reference’ category.

You can see this from the names of the coefficients:


* (Intercept) 
* engine_size 
* automaticTRUE 
* engine_size:automaticTRUE 

<!-- end of list -->

The first slope coefficient is just called engine_size. The second is engine_size:automaticTRUE. so we can infer that the first is for cars where automatic = False.

  • What is the slope for engine_size for automatic cars?

Remember that to get the slope for automatic cars you need to add both slope coefficients together.

  • How much variance does the model explain?

Around 77% of the variation in MPG can be explained by engine size and transmission type:

m3 %>% 
  glance() %>% 
  # only show first 2 cols
  select(1:2) %>% 
  # round the results to 2dp
  round(2)
# A tibble: 1 x 2
  r.squared adj.r.squared
      <dbl>         <dbl>
1      0.79          0.77

Use the funimagery data to make this plot:

funimagery %>% 
  ggplot(aes(kg1, weight_lost_end_trt, color=intervention)) + 
  geom_point() + 
  geom_smooth(method=lm)
  • Describe the pattern of results you see

The plot suggests that participants who were heavier at baseline tended to lose more weight at the end of FIT treatment (functional imagery training), but this wasn’t the case for MI (motivational interviewing).

  • Fit a linear model to predict weight lost at the end of treatment from i) weight at the start of treatment and ii) intervention group, including the interaction term.
lm(weight_lost_end_trt ~ kg1 * intervention, data=funimagery) 

Call:
lm(formula = weight_lost_end_trt ~ kg1 * intervention, data = funimagery)

Coefficients:
        (Intercept)                  kg1      interventionFIT  kg1:interventionFIT  
            -5.6391               0.0489               8.5799              -0.1367  
  • What is the difference in the slope for baseline weight between the two intervention groups?

The coefficient labelled kg1:interventionFIT gives the difference in slopes between intervention groups.