In brief
Multiple regression is a technique which can describe the relationship between one outcome and two or more predictors.
We can also use multiple regression to describe cases where two variables interact. That is, when the effect of one predictor is increased or decreased by another (this is called moderation, as we saw in the first session).
Multiple regression is important because it allows us to make more realistic models and better predictions. But — like any sharp tool — it should be used carefully. If our statistical model doesn’t match the underlying network of causes and effects, or if we have used a biased sample, we can be misled.
When evaluating our models we can ask two useful questions: First, how much of the variability in the outcome do my predictors explain? The \(R^2\) statistic answers this. Secondly: does the model make better predictions than just taking the average outcome? For this we can compute a BayesFactor.
Before you start
- Make a new Rmd file to save your work in this session. Call it
regression-multiple.rmd
. Make sure you save it in the same directory (folder) as your data handling and regression rmd files (something likermip_2021
).
Causal diagrams
Think back to our first session on causes and effects. When we drew causal diagrams of our research question we found cases where there were:
- Multiple causes of a single outcome, and where
- One variable might alter the effect of another
We drew diagrams like this for those cases:
Or to give a concrete example:
When we say that effect modification (i.e. moderation) is taking place, this is really what we mean:
Benefits of multiple regression
When we have a causal diagram involving multiple variables there are three key benefits to using multiple regression:
Reducing variance/Improving prediction: adding variables can increase precision and the statistical power of our predictions.
Testing moderation or interactions: where the relationship between two variables might be changed by a third variable we can test for this using multiple regression (or Anova; as you’ll see later, Anova is just a special kind of linear model).
‘Controlling for’ bias in observational data: Used wisely, in conjunction with causal diagrams, multiple regression can help estimate the causal relations between variables using observational data (although we need to be very careful as we do this).
You can read more about each of these benefits here. However, our focus in using multiple regression in this module is to test for moderation.
Testing moderation
Moderation is when two predictor variables interact to cause an outcome.
If you have a moderation hypothesis — for example, that a relationship predictor and outcome might be different between groups — the first thing you should do is plot the data.
As an example, I use the studyhabits
dataset we used previously from psydata
We already know there is a link (in these data) between study hours and grades because we can see it in the plot (and we modelled it using lm
in the first regression session):
library(psydata)
%>%
studyhabits ggplot(aes(work_hours, grade)) +
geom_jitter() +
geom_smooth(method=lm, se=F)
Interaction plots
We could also ask: “is this relationship the same for men and women?” To visualise the differences we can use colour:
%>%
studyhabits ggplot(aes(work_hours, grade, color=female)) +
geom_jitter() +
geom_smooth(method="lm", se=F)
Explanation of the code and output I added color=gender
to the ggplot function. This splits the data by gender. We can see that the slopes for men and women do look different.
This type of plot shows the difference in slopes between two (or more) categories and is called an ‘interaction plot’.
- Use the code above to reproduce the coloured scatter-plot.
- Work with a partner. Agree together:
- What is the overall pattern of these simulated results?
- Does extra time spent working benefit men and women equally?
- Adapt the code to produce another plot with a different variable on the x axis. Are the slopes for men and women the same or different this time?
# use `work_consistently` as the x axis
%>%
studyhabits ggplot(aes(work_consistently, grade, color=female)) +
geom_jitter() +
geom_smooth(method="lm", se=F)
Using lm
with two predictors
To add more predictors to our model we can simply add them with a “+
” symbol. For example:
lm(grade ~ work_hours + female, data = studyhabits)
Call:
lm(formula = grade ~ work_hours + female, data = studyhabits)
Coefficients:
(Intercept) work_hours femaleTRUE
38.8234 0.8719 -1.5761
This model now has three coefficients:
(Intercept)
work_hours
femaleTRUE
These three coefficients describe two sloped lines:
- The steepness of one slope:
work_hours
- The place on the y axis this slope starts from:
(Intercept)
- The difference in the position of the two slopes, which is constant across the x axis
The plot this model represents it looks like this:
Explanation: Adding female
to the model allows the intercepts to vary between men and women, but the slopes are still fixed to the same value. The lines on the plot look a slightly better fit, but not much.
We need to allow for an interaction if we want our model to match the data.
Adding an interaction
To add an interaction we need to replace the “+
” symbol with a “*
” in our formula:
lm(grade ~ work_hours * female, data = studyhabits)
Call:
lm(formula = grade ~ work_hours * female, data = studyhabits)
Coefficients:
(Intercept) work_hours femaleTRUE work_hours:femaleTRUE
18.623 1.795 22.463 -1.074
Explanation: This time we have changed the formula and:
- added
female
as a second predictor and - used the
*
symbol betweenwork_hours
andfemale
- this allows the slope for
work_hours
to be different for men and women.
Description of the output: This time, we have 4 coefficients shown:
(Intercept)
work_hours
femaleTRUE
work_hours:femaleTRUE
Interpreting the coefficients: These coefficients have changed their meaning from the first model we ran. However we can still think of them as either points or slopes on the graph with fitted lines overlaid:
(Intercept)
is the point for men, wherework_hours
= 0. Whenfemale
=False
this is a man, so we are looking at the red line, where it crosses zero on the x axis. Find this point on the plot now.femaleTRUE
is the difference between men and women, whenwork_hours
= 0 (the difference between the blue and red lines, at the zero point on the x axis)work_hours
is the slope (relationship) betweenwork_hours
andgrade
for men (the steepness of the red line)work_hours:femaleTRUE
is the difference in slopes for work hours, for women. So this is the slope for women minus the slope for men: that is, the difference in steepness between the red and blue lines. It’s NOT the slope of the blue line. We need to add both slope coefficients to see that.
Compare the model output and plot which are repeated below:
Call:
lm(formula = grade ~ work_hours * female, data = studyhabits)
Coefficients:
(Intercept) work_hours femaleTRUE work_hours:femaleTRUE
18.623 1.795 22.463 -1.074
In pairs:
- For each of the 4 coefficients, agree if it represents a point or a slope
- Find each of the points on the plot (i.e. which coefficient is it)
- Compare the slope coefficients to the lines on the plot - can you explain which coefficient describes which slope?
- What would happen if the sign of each coefficient was reversed? E.g. if the coefficient was now a minus number? What would this mean for the plot?
Double check you understand how to interpret the work_hours:femaleTRUE
coefficient. It’s very common for regression coefficients to represent differences in this way. But in this example it does mean we have to know both work_hours
(the slope for men) and work_hours:femaleTRUE
(the difference in slopes for men and women) to be able to work out the slope for women.
To test your knowledge: What is the slope for women in this model?
The answer is 0.72.
This is because the data is coded such that the work_hours
slope is the coefficient for men. To get the slope for women we have to add:
- the slope for men (1.79)
- the difference in slopes between men and women (-1.07)
So the slope for women is: 1.79 + -1.07 = 0.72.
Prediction
This section is an optional extension exercise. It is not required for the PSYC519/719 assessment, but would be useful if you are planning on using these methods in your own research.
(If you haven’t completed it just now, re-run the code from the previous section and make sure you have a variable called third.model
stored in your Environment pane in RStudio):
lm(grade ~ work_hours * female, data = studyhabits) third.model <-
Because there are now 4 coefficients in this third model, it’s more fiddly to make predictions by hand.
R’s augment
function (in the broom
library) will do it for us. By default, augment
uses the regression model (the fitted lines) to make a prediction for each row in the original dataset. So, for example, we can say:
library(broom)
augment(third.model) %>%
# only show the first 4 columns
select(1:4) %>%
# and the first 6 rows
head(6)
# A tibble: 6 x 4
grade work_hours female .fitted
<dbl> <dbl> <lgl> <dbl>
1 65 26 TRUE 59.8
2 57 26 TRUE 59.8
3 45 16 FALSE 47.3
4 70 33 TRUE 64.9
5 55 25 TRUE 59.1
6 50 20 TRUE 55.5
Explanation of the code:
- We used
augment
to make predictions for all the observations in our dataset. - We use
head(6)
to show only the first 6 predictions - We used
select(1:4)
to show only the first 4 columns
Explanation of the output: R has made a prediction for grade
for each person in the dataset, based on the coefficients from the model. Because we used head
and round
we only see the first 10 predictions, and the results are rounded (see what happens without for yourself).
Predicting for new data
Often, we don’t want predictions for the data in our sample: Instead we want to make predictions at particular values of the predictor variables.
For example, we might want to make predictions for the typical person, or specifically for a woman with a high or low number of work_hours
.
To do this we need to:
- Create a new dataset where the predictors are set to the values we’re interested in.
- Give this new data to the
augment
function, as well as the fitted model.
We can make a new dataset like this:
tibble(female=c(TRUE,TRUE), work_hours=c(10, 20)) typicalpeople <-
Explanation of the code: We use tibble
to make a new dataframe. The c()
command is short for ‘combine into a list’. So, this code creates a new dataframe with 2 rows, where female is always TRUE
, and where work_hours
is either 10 or 20.
We should check we got what we expected by looking at the new dataframe:
typicalpeople
# A tibble: 2 x 2
female work_hours
<lgl> <dbl>
1 TRUE 10
2 TRUE 20
Then we can send this new dataframe to the augment
again:
augment(third.model, newdata = typicalpeople)
# A tibble: 2 x 4
female work_hours .fitted .se.fit
<lgl> <dbl> <dbl> <dbl>
1 TRUE 10 48.3 1.83
2 TRUE 20 55.5 0.821
Explanation: We made a new dataframe containing only 2 rows. We passed this to the augment
function along with our regression model (third.model
). We got two predictions back: one for a woman who works 10 hours, and the other for a woman who works 20 hours. The .fitted
column provides the prediction.
Now calculate predictions for:
- a man who does 35 hours of work each week
- a woman who does 22.5 hours of work each week
Additional (trickier) extension exercise:
- Make a plot from a set of predictions
How good?
In describing how we fit lines to data, I explained that lm
attempts to minimise the squared error of the predictions it makes.
For a single-predictor model this was quite easy to visualise:
R fits a line which minimises the area in the red squares (shown on the right in blue).
We can describe how good the model is at making predictions using the relative sizes of these red and blue squares.
- Red squares: the variance in the outcome variable, \(y\)
- Blue squares: the residual or unexplained variance, after fitting the model
The \(R^2\) statistic describes the proportion of the red squares eliminated by fitting the model, but this also works for models which have more than 1 predictor.
\(R^2\) (the statistic)
Note: the ‘R’ in \(R^2\) has nothing to do with the R software - it’s the name of the statistic—like a ‘t-test’ or a ‘p value’.
\(R^2\) tells us how much variance in the outcome is explained by the model.
Above, we ran a model to predict grades in the studyhabits
dataset:
lm(grade ~ work_hours * female, data = studyhabits) third.model <-
To calculate \(R^2\) we can use the glance
command in the broom
package.
library(broom)
glance(third.model) %>%
# only show the first 2 columns
select(1:2)
# A tibble: 1 x 2
r.squared adj.r.squared
<dbl> <dbl>
1 0.221 0.213
Explanation of the output:
- First we loaded the
broom
package. - Then,
glance
produced a dataframe of statistics relating to our model. - For the moment we only need the first two columns:
r.squared
andadj.r.squared
. - For this model,
adj.r.squared = 0.21
, which means 21% of the variability ingrade
was explained bywork_hours
andfemale
, and their interaction.
To report this in APA format, you would say:
To explain variation in participants’ grades we used multiple regression and included the following predictors: number of working hours, gender, and the interaction those variables. This model explained 21% of the variance in grades (adjusted \(R^2\) = 0.21).
The adj.r.squared
value stands for ‘adjusted \(R^2\)’. The adjustment is made because as you add more predictors to a model it will always explain more variability — just by chance (and if you add too many predictors you risk over-fitting the data, as we saw in the last session).
The adjustment reduces \(R^2\) to account for over-fitting (to some degree).
The gap between regular and adjusted \(R^2\) will tend to grow as you add more predictors.
You should always use
adj.r.squared
in any reports or publications evaluating your model. Some people like to report both unadjusted and adjusted figures. I don’t mind, but I would always want to see at least the adjusted \(R^2\).
Run a model using the development
dataset in which:
Life expectancy is the outcome.
GDP (national income) is the predictor.
Calculate \(R^2\) for this model.
How much of the variance in life expectancy can be explained by GDP?
# run the model
lm(life_expectancy ~ gdp_per_capita, data=development) %>%
# calculate some model statistics
glance() %>%
# only show first 2 columns of output (all we need)
select(1:2)
# A tibble: 1 x 2
r.squared adj.r.squared
<dbl> <dbl>
1 0.341 0.340
Consolidation
The exercises below are to check you understand the materials, and give examples to talk through in detail with TARAs/staff.
Staff won’t discuss your analysis and interpretation of the rmipx
dataset explicitly, but we can answer questions about your work on these datasets.
Use the fuel
data to make this plot, and describe in words the pattern of the results:
What is the slope for the effect of engine size?
Why is this number so small when the slope on the graph looks fairly steep?
The coefficient for engine_size
tells us how many additional MPG we expect for each addition ‘cc’ (cubic centimeter) of engine size. Each additional cc has a small effect, but engine_size
in this sample ranges from 1170 to 7730, so it all adds up.
- Does the
engine_size
slope apply to automatic or manual cars?
In this dataset manual cars are the ones where the automatic
column is equal to false. This makes them the ‘reference’ category.
You can see this from the names of the coefficients:
* (Intercept)
* engine_size
* automaticTRUE
* engine_size:automaticTRUE
<!-- end of list -->
The first slope coefficient is just called engine_size
. The second is engine_size:automaticTRUE
. so we can infer that the first is for cars where automatic = False.
- What is the slope for engine_size for automatic cars?
Remember that to get the slope for automatic cars you need to add both slope coefficients together.
- How much variance does the model explain?
Around 77% of the variation in MPG can be explained by engine size and transmission type:
%>%
m3 glance() %>%
# only show first 2 cols
select(1:2) %>%
# round the results to 2dp
round(2)
# A tibble: 1 x 2
r.squared adj.r.squared
<dbl> <dbl>
1 0.79 0.77
Use the funimagery
data to make this plot:
%>%
funimagery ggplot(aes(kg1, weight_lost_end_trt, color=intervention)) +
geom_point() +
geom_smooth(method=lm)
- Describe the pattern of results you see
The plot suggests that participants who were heavier at baseline tended to lose more weight at the end of FIT treatment (functional imagery training), but this wasn’t the case for MI (motivational interviewing).
- Fit a linear model to predict weight lost at the end of treatment from i) weight at the start of treatment and ii) intervention group, including the interaction term.
lm(weight_lost_end_trt ~ kg1 * intervention, data=funimagery)
Call:
lm(formula = weight_lost_end_trt ~ kg1 * intervention, data = funimagery)
Coefficients:
(Intercept) kg1 interventionFIT kg1:interventionFIT
-5.6391 0.0489 8.5799 -0.1367
- What is the difference in the slope for baseline weight between the two intervention groups?
The coefficient labelled kg1:interventionFIT
gives the difference in slopes between intervention groups.