Overview
\(R^2\)
In this workshop we introduce multiple regression, that is: regression models which include 2 or more predictors. Multiple regression is a core technique in psychology and most other quantitative sciences.
In this workshop we focus on the case where you have one continuous outcome variable, and two continuous predictor variables. The notes emphasise the connection between graphical models, that represent hypotheses about how the data were generated, with specific linear models. To evaluate multiple regression models we can check their residuals, and to compare models we can use a Bayes Factor: both techniques are demonstrated.
Learning outcomes
- Add multiple predictors to a linear model
- Calculate and interpret how much variation a model explains using \(R^2\)
- Calculate a Bayes Factor for a model to compare hypotheses
Using multiple regression
We often have multiple possible predictors of an outcome variable. We want to run models which use more than one predictor. We also need to compare different models to see which is the best.
This first video explains how to fit multiple regression models which include more than one predictor.
# load libraries
library(psydata)
library(tidyverse)
# check data we will use
fuel %>% glimpse
# make some illustrative plots
fuel %>% ggplot(aes(weight, mpg)) + geom_point()
fuel %>% ggplot(aes(cyl, mpg)) + geom_point()
fuel %>% ggplot(aes(cyl, weight)) + geom_point()
# run and save linear models to match the possible hypotheses
modelA <- lm(mpg ~ weight, data=fuel)
modelB <- lm(mpg ~ weight + cyl, data=fuel)
# look at the coefficients from each model
modelA
modelB
This video provides an overview of multiple regression, and does a
rapid walk-through of fitting two models which represent
different hypotheses about how the
fuel
dataset was generated.
In the rest of this workshop we break down the techniques used, and so the code is not explained in detail in this transcript; work through the explanations and examples below for full details.
However, some key ideas to take away are:
- Multiple regression includes more than one predictor of an outcome
- By comparing models with different predictors we are testing different hypotheses about the data
In the example, there are three variables in the fuel
dataset that we use, recording attributes of various cars:
weight
(the car’s weight in kg), cyl
(the
number of cylinders the engine has), and mpg
(how many
miles per gallon the car can drive).
Before modeling, I plot the data, and add a best-fit line to the scatterplot.
I run a model where mpg
is predicted by weight, which I
call modelA
.
I then add cyl
to the model formula, to say that
mpg
is predicted by weight
and
cyl
. The formula is mpg ~ weight + cyl
.
In the video we check the output of the model and see that there is an extra coefficient (more on this in the notes below).
We notice also that the coefficient for weight
changes
when we add cyl
. This is also to be expected and is
explained further below.
Fuel Economy Example
The fuel
dataset contains information about a selection
of cars. Our task: We want to understand the relationship
between fuel economy, the weight of a car, and how many cylinders it
has.
We can plot the data to give us some clues. First, we see that as car weight goes up, fuel economy goes down:
We also see that as the number of cylinders goes up, economy goes down:
We might conclude that both weight and the number of cylinders are important. But we need to be careful! If we plot weight against the number of cylinders, we can see that these predictors are also related.
Our problem: We don’t know which of these predictors (weight or number of cylinders) are important. It’s possible that confounding is occurring and one of these relationships is spurious.
Choosing between hypotheses
We want to distinguish between (at least) two hypotheses. We can represent them visually like this:
Hypothesis A: Weight predicts fuel economy, and heavier cars tend to have more cylinders (weight and number of cylinders are correlated). But the number of cylinders doesn’t actually matter for economy. Knowing the number of cylinders won’t help us predict economy better, once we know the weight of the car. | Hypothesis B: Both weight and the number of cylinders are independent predictors of fuel economy. Both provide useful information, and we need to use both variables to make the best possible prediction of fuel economy. |
Adding a second predictor
We can think of linear models as representing different hypotheses. In the diagrams above:
- Hypothesis A is a model with a single predictor of fuel economy
(
weight
) - Hypothesis B is a model with two predictors of fuel economy
(
weight
and number of cylinders,cyl
)
Our task is to:
- Fit both models
- Quantify how well each one fits the data
- Compare them, and choose the simplest model which fits the data well
In session 5 we saw how to define the outcome and predictor variables for a simple regression. For example:
To add a second predictor to our lm
model we just use
the +
symbol. For example:
Explanation of the code: To add a second predictor
to our model we used the +
symbol to separate the names of
the two predictors: weight + cyl
. This makes R estimate
coefficients for both predictor variables.
Additional coefficients in the output
If we look at the output from modelA
and
modelB
we can see that Model A’s output looks similar to
those we ran before:
Call:
lm(formula = mpg ~ weight, data = fuel)
Coefficients:
(Intercept) weight
37.28714 -0.01178
As before we have two coefficients:
intercept
, which represents the height of the regression line when the predictors equal zeroweight
, which represents the steepness of the slope (i.e. how strongly weight predicts economy)
However, the output for model B differs slightly:
Call:
lm(formula = mpg ~ weight + cyl, data = fuel)
Coefficients:
(Intercept) weight cyl
39.687037 -0.007037 -1.507643
We have the intercept, and the slope for weight
as
before. But there is an extra coefficient, for cyl
.
This is the change in the outcome we expect for each
extra cylinder the engine has, i.e. for each unit increase in the
cyl
variable.
Making predictions with 2 predictors
If we want to make predictions, we now have to use both
weight
and cyl
, but the process is very
similar. Imagine we have a car with 4 cylinders, which weighs 3000kg (so
weight
=3000).
Our prediction is:
\(intercept + (4 \times cyl) + (3000 \times weight)\)
So, in this case, our prediction would be:
- The intercept is 39
- The slope for
weight
is -.007 - The slope for
cyl
is -1.5
So the calculation is \(39 + 3000 \times -.007 + 4 \times -1.5 = 12\) (allowing for rounding errors).
We could also use augment
to make the prediction, as we did in the previous
session:
augment
predicts 12.5, but this is because it did not
round the coefficients before doing the multiplication. The result is
approximately the same (and close enough).
Understanding changes in the coefficients
Coefficients can change when adding new predictors.
You might notice that the weight
coefficient has changed
between the two models. In modelA
it is -0.012, but in
model B it is -0.007
Why did it go down so much?
The coefficients are the change in the outcome for a 1 unit change in the predictor IF all the other predictors had stayed at the same value.
We describe this as the effect of additional weight, adjusting for how many cylinders the car had.
So, in this example, once we have adjusted for the number of cylinders, the remaining or marginal effect of weight is reduced (although not eliminated).
It can help to imagine what would happen if we fit a regression:
- For the whole sample, ignoring the number of cylinders (like in model A)
- Separately for cars of 3, 4, 5 … cylinders.
The plot below shows what this would look like:
The coefficient weight
in model A is equivalent to the
black line. The coefficient weight
in model B, is
approximately equivalent to the average of the red, green and blue
slopes.
We can see that the average slope of the red, green and blue lines is less steep than the black line. That is, once we have accounted for the number of cylinders, our estimate of the relationship between weight and economy is reduced..
Evaluating models
How do we know which of the two models is the best?
One way is to check the residuals from each model (and plot them). Smaller residuals are better.
The \(R^2\) statistic formalises this somewhat. It is a number from 1 (the model predicts the outcome perfectly) to 0 (the model doesn’t predict anything).
Always report the ‘adjusted \(R^2\)’.
We saw previously that residuals represent the distance between the predicted and observed data. Larger residuals mean bigger errors in prediction, and amaller errors are better.
The plot below shows the distribution of residuals from models A and B.
Based on the distribution of the residuals, which model seems to make the best predictions?
Model B (representing Hypothesis B) makes better predictions, because more of the residuals are clustered around zero. That is, the error in the predictions is smaller.
R-squared (R2)
\(R^2\) is the proportion of variance in the outcome explained by the predictor variables.
\(R^2\) summarises how large the residuals are for different models, relative to the residual from a model without any predictors.
It ranges from 0 (models which are very poor and have large residuals), to 1 (models with perfect prediction and zero residuals). Because larger \(R^2\) values are better, we can use it to compare models with additional predictors.
To calculate \(R^2\) we use the
glance
command on our saved model result:
Explanation of the code: We use the
glance
, providing our saved result for modelA
as the input.
Explanation of the output:
glance
produces a dataframe as a result, containing columns
for r.squared
, adj.r.squared
, and some other
statistics which we don’t need for now. The number to focus on here is
adj.r.squared
, which is 0.74.
Why use the adjusted \(R^2\), and what is the adjustment?
Unadjusted, \(R^2\) becomes increasingly over-optimistic as we add extra predictors to our model. The adjusted \(R^2\) figure corrects for this, reducing the raw \(R^2\) as we add extra predictors.
Always report the adjusted \(R^2\) value.
For more detail, see Field’s (2012) textbook, Discovering Statistics Using R, p. 273.
We can repeat the process for modelB
:
Explanation of the output: This time we
calculated adj.r.squared
for modelB
, which
included cyl
as an extra predictor. Adjusted \(R^2\) has now gone up to 0.82.
To summarise the results above:
- We predicted fuel economy using
weight
alone (model A) and adjusted \(R^2\) was 0.75. - We added
cyl
(number of cylinders) in model B. Adjusted \(R^2\) increased to 0.82.
Models with higher \(R^2\) have smaller residuals and make better predictions. This means we should generally prefer model B.
But, if the increase in \(R^2\) is only small we might worry that the difference is due to chance — for example, because of random variations in our dataset. The final step is to quantify the evidence for each hypothesis.
Strength of the evidence
Our final task is to quantify how much evidence we have that model B is better than model A. For this we use a Bayes Factor.
As we saw in session 5, a Bayes Factor always compares two hypotheses. It tells us how much more likely one hypothesis is than an alternative hypothesis, given the data we collected.
We will use Bayes Factors to compare hypotheses A and B, which are represented by model A and model B.
To calculate the Bayes Factor we need to:
- Load the
BayesFactor
package - Re-run our models using the
lmBF
function - Divide model B by model A
The complete code is as follows:
Next we re-run the models using the lmBF
function. The
only changes is to replace lm
with lmBF
.
# save model A
mA <- lmBF(mpg ~ weight, data=fuel)
# save model B
mB <- lmBF(mpg ~ weight + cyl, data=fuel)
Finally, we divide model B by model A to calculate the Bayes Factor.
Bayes factor analysis
--------------
[1] weight + cyl : 22.7057 ±0.01%
Against denominator:
mpg ~ weight
---
Bayes factor type: BFlinearModel, JZS
Explanation of the code: We divide
mB
by mA
using the /
symbol.
Explanation of the output: Our code produces a Bayes Factor showing the evidence in favour of model B. The Bayes Factor is 22.71.
Interpreting the Bayes Factor
In this example the Bayes Factor is 22.71, so we have strong evidence in favour of model B, compared with model A.
Put differently model B is 22 times more likely than model A, given the data we collected.
Interpreting Bayes Factors
We previously said that:
- BF’s > 3 are moderate evidence for a hypothesis, and
- BF < 0.33 (\(\tfrac{1}{3}\)) are moderate evidence against a hypothesis.
Psychologists love a “line in the sand”, so a convention has emerged that we believe an effect is ‘real’ if the Bayes Factor is greater than 3, and believe there isn’t a ‘real’ effect if the Bayes Factor is less than 0.33.
In fact though, Bayes Factors are a continuous measure of the evidence for or against a particular model (hypothesis).
Several authors, including Kass & Raftery (1995), have written guidelines on how to interpret Bayes Factors in this way. They provide labels for Bayes Factors of various sizes. This is a more useful than a simple “line in a sand”, especially when comparing multiple studies.
Bayes Factor | Interpretation |
---|---|
> 100 | Extreme evidence for the hypothesis tested |
30 - 100 | Very strong evidence for the hypothesis tested |
10 - 30 | Strong evidence for the hypothesis tested |
3 - 10 | Moderate evidence for the hypothesis tested |
1 - 3 | Anecdotal evidence for the hypothesis tested |
Bayes Factors can also be smaller than 1, but these indicate support for the null hypothesis. If this is the case, we often find it helpful to reverse the order of division so that the BF is larger than 1 and state explicitly in the text that the BF we are reporting indicates the support for the null hypothesis.
Imagine we are comparing two regression models, A and B. In the R output we see that the Bayes Factor for model A is 22, and the Bayes Factor for Model B is 56.
- Which model is more probable?
- The BF in favour of model B is:
- The BF in favour of model A is:
For the BF in favour of model A, we calculate \(BF_{A} / BF_{B}\) = 22/56 = 0.4
For the BF in favour of model B, we calculate \(BF_{B} / BF_{A}\) = 56/22 = 2.5
Note that \(1 / 0.4 = 2.5\). We can use this trick (taking the reciprocal) to ‘reverse’ any Bayes Factor that is < 1 so that it expresses evidence in favour of the other hypothesis.
Overall synopsis
In the example above our outcome variable was fuel economy and we had two potential predictor variables: the car’s weight, and the number of cylinders it had.
We ran two models (A and B) which represented different hypotheses:
- Hypothesis A said only weight was needed to predict fuel economy.
- Hypothesis B said that both weight and cylinders would be independent predictors of fuel economy.
After fitting both models using lm
we calculated \(R^2\) to estimate how well each model
fitted the data.
\(R^2_{adjusted}\) was .75 form model A, and .82 for model B, which means both models were quite good. Both explained a large proportion of the variability in fuel economy.
To compare the models, we re-fitted them using
lmBF
and calculated a Bayes Factor. This told us that –
although both were quite a good fit to the data – we had 22 times more
evidence for model A compared with model B.
From this example we conclude that both weight and cylinders are independent predictors of fuel economy. That is, both variables make a useful contribution when predicting fuel economy.
Check your knowledge
We prefer models with residuals
The \(R^2\) statistic ranges from:
It’s better for a model to have a \(R^2\) value
If we calculate the Bayes Factor for
modelA / modelB
and the result is > 1, this provides evidence thatA Bayes Factor of 25 means we have evidence
A Bayes Factor < 1 means we have
Additional examples and exercises
To consolidate what we have learned this section asks you to follow the same process using a different dataset.
Student study habits
The studyhabits
dataset (in the psydata
package) contains data on students and their study habits:
- work_hours: How many hours per week the student worked
- work_consistently: A measure of how consistently the student worked each day
- grade: Their assessment grade (%)
The tasks below use these data to pose a series of questions to check your understanding. They ask you to run additional analyses to built confidence using R.
Task 1: Interpret these plots
Based on these plots of the studyhabits
data:
Hours worked and assessment grades are
Working consistently is likely to produce grades
Students who work more consistently tend to work for time overall
Task 2: Describe alternative hypotheses
Imagine we are interested in the effect of working consistently each day on grades, adjusting for the total number of hours studied. That is, we want to know if working consistently helps predict grades, even after we know the total hours spent studying.
A. Fill in the blanks in the following diagram:
B. Which hypothesis do you think is more likely, based on the plots and your own experience?
What is the correct code to run models for each of the hypotheses we described above?
Hypothesis A:
Hypothesis B:
Task 4: Interpret the Bayes Factor
Imagine you have already run the following code:
modA <- lmBF(grade ~ work_hours, data=studyhabits)
modB <- lmBF(grade ~ work_hours + work_consistently, data=studyhabits)
- The command to display calculate the Bayes Factor in favour of Model B is:
Task 5: Interpret the Bayes Factor output
Imagine you had run the code below:
modA <- lmBF(grade ~ work_hours, data=studyhabits)
modB <- lmBF(grade ~ work_hours + work_consistently, data=studyhabits)
Interpret the following output:
Bayes factor analysis
--------------
[1] work_hours + work_consistently : 6.627368e+13 ±0%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
- We can interpret the Bayes Factor in this output as saying:
- The Bayes Factor for model B is:
In the section above we only look at the output of modB
.
The default in the BayesFactor
package is to give the BF
for the comparison against a model with no predictors. This is
sometimes what you want, but normally you will have to divide
one model by another to get the correct Bayes Factor.
The BF number is written in scientific notation as
6.627368e+13
This means it is roughly \(6.627 \times 10^{13}\), so about
66273680000000.
Always double check when you see an e
in a number in R
output.
Useful: https://www.calculatorsoup.com/calculators/math/scientific-notation-converter.php
National happiness
This last exercise is not required; it is offered as an additional challenge for students who have time, or who wish to practice and consolidate skills learned.
The plots below shows the distribution of and relationship between
three variables in the happy
dataset, in the psydata
package.
These data are a simplified version of the Gallup World Happiness report (and see more details on the questions asked here).
The variables shown are the national average of responses to three questions:
- happiness: a measure of subjective well-being, rated on a 1-10 scale
- social_support: responses to the question: "If you were in trouble, do you have relatives or
friends you can count on to help you whenever you need them, or not?"
- freedom_to_make_life_choices: responses to "Are you satisfied or dissatisfied with your
freedom to choose what you do with your life?"
Using the happy
data:
Write code to run linear models to predict
happiness
using the ‘social support’ and ‘freedom’ variables.Draw diagrams to describe two alternative hypotheses, similar to those used above.
Calculate a Bayes Factor which compares those two alternative hypotheses. Which hypothesis is more likely?
There is no single set of ‘correct’ answers to these questions. If you would like feedback on your work, speak to a member of staff during the workshops.