Building models

Ben Whalley, Paul Sharpe, Sonja Heintz

Overview

\(R^2\)

In this workshop we introduce multiple regression, that is: regression models which include 2 or more predictors. Multiple regression is a core technique in psychology and most other quantitative sciences.

In this workshop we focus on the case where you have one continuous outcome variable, and two continuous predictor variables. The notes emphasise the connection between graphical models, that represent hypotheses about how the data were generated, with specific linear models. To evaluate multiple regression models we can check their residuals, and to compare models we can use a Bayes Factor: both techniques are demonstrated.

Learning outcomes

  • Add multiple predictors to a linear model
  • Calculate and interpret how much variation a model explains using \(R^2\)
  • Calculate a Bayes Factor for a model to compare hypotheses

Using multiple regression

We often have multiple possible predictors of an outcome variable. We want to run models which use more than one predictor. We also need to compare different models to see which is the best.

This first video explains how to fit multiple regression models which include more than one predictor.

# load libraries
library(psydata)
library(tidyverse)


# check data we will use
fuel %>% glimpse


# make some illustrative plots
fuel %>% ggplot(aes(weight, mpg)) + geom_point()
fuel %>% ggplot(aes(cyl, mpg)) + geom_point()
fuel %>% ggplot(aes(cyl, weight)) + geom_point()



# run and save linear models to match the possible hypotheses
modelA <- lm(mpg ~ weight, data=fuel)
modelB <- lm(mpg ~ weight + cyl, data=fuel)

# look at the coefficients from each model
modelA
modelB

This video provides an overview of multiple regression, and does a rapid walk-through of fitting two models which represent different hypotheses about how the fuel dataset was generated.

In the rest of this workshop we break down the techniques used, and so the code is not explained in detail in this transcript; work through the explanations and examples below for full details.

However, some key ideas to take away are:

  • Multiple regression includes more than one predictor of an outcome
  • By comparing models with different predictors we are testing different hypotheses about the data

In the example, there are three variables in the fuel dataset that we use, recording attributes of various cars: weight (the car’s weight in kg), cyl (the number of cylinders the engine has), and mpg (how many miles per gallon the car can drive).

Before modeling, I plot the data, and add a best-fit line to the scatterplot.

I run a model where mpg is predicted by weight, which I call modelA.

I then add cyl to the model formula, to say that mpg is predicted by weight and cyl. The formula is mpg ~ weight + cyl.

In the video we check the output of the model and see that there is an extra coefficient (more on this in the notes below).

We notice also that the coefficient for weight changes when we add cyl. This is also to be expected and is explained further below.

Fuel Economy Example

The fuel dataset contains information about a selection of cars. Our task: We want to understand the relationship between fuel economy, the weight of a car, and how many cylinders it has.

We can plot the data to give us some clues. First, we see that as car weight goes up, fuel economy goes down:

We also see that as the number of cylinders goes up, economy goes down:

We might conclude that both weight and the number of cylinders are important. But we need to be careful! If we plot weight against the number of cylinders, we can see that these predictors are also related.

Our problem: We don’t know which of these predictors (weight or number of cylinders) are important. It’s possible that confounding is occurring and one of these relationships is spurious.

Choosing between hypotheses

We want to distinguish between (at least) two hypotheses. We can represent them visually like this:

Hypothesis A: Weight predicts fuel economy, and heavier cars tend to have more cylinders (weight and number of cylinders are correlated). But the number of cylinders doesn’t actually matter for economy. Knowing the number of cylinders won’t help us predict economy better, once we know the weight of the car. Hypothesis B: Both weight and the number of cylinders are independent predictors of fuel economy. Both provide useful information, and we need to use both variables to make the best possible prediction of fuel economy.


Adding a second predictor

We can think of linear models as representing different hypotheses. In the diagrams above:

  • Hypothesis A is a model with a single predictor of fuel economy (weight)
  • Hypothesis B is a model with two predictors of fuel economy (weight and number of cylinders, cyl)


Our task is to:

  1. Fit both models
  2. Quantify how well each one fits the data
  3. Compare them, and choose the simplest model which fits the data well




In session 5 we saw how to define the outcome and predictor variables for a simple regression. For example:

modelA <- lm(mpg ~ weight, data=fuel)


To add a second predictor to our lm model we just use the + symbol. For example:

modelB <- lm(mpg ~ weight + cyl, data=fuel)

Explanation of the code: To add a second predictor to our model we used the + symbol to separate the names of the two predictors: weight + cyl. This makes R estimate coefficients for both predictor variables.

Additional coefficients in the output

If we look at the output from modelA and modelB we can see that Model A’s output looks similar to those we ran before:

modelA

Call:
lm(formula = mpg ~ weight, data = fuel)

Coefficients:
(Intercept)       weight  
   37.28714     -0.01178  

As before we have two coefficients:

  • intercept, which represents the height of the regression line when the predictors equal zero
  • weight, which represents the steepness of the slope (i.e. how strongly weight predicts economy)



However, the output for model B differs slightly:

modelB

Call:
lm(formula = mpg ~ weight + cyl, data = fuel)

Coefficients:
(Intercept)       weight          cyl  
  39.687037    -0.007037    -1.507643  

We have the intercept, and the slope for weight as before. But there is an extra coefficient, for cyl.

This is the change in the outcome we expect for each extra cylinder the engine has, i.e. for each unit increase in the cyl variable.

Making predictions with 2 predictors

If we want to make predictions, we now have to use both weight and cyl, but the process is very similar. Imagine we have a car with 4 cylinders, which weighs 3000kg (so weight=3000).

Our prediction is:

\(intercept + (4 \times cyl) + (3000 \times weight)\)

So, in this case, our prediction would be:

  • The intercept is 39
  • The slope for weight is -.007
  • The slope for cyl is -1.5

So the calculation is \(39 + 3000 \times -.007 + 4 \times -1.5 = 12\) (allowing for rounding errors).

We could also use augment to make the prediction, as we did in the previous session:

# the .fitted column is the prediction
augment(modelB, newdata=tibble(weight=3000, cyl=4))

augment predicts 12.5, but this is because it did not round the coefficients before doing the multiplication. The result is approximately the same (and close enough).

Understanding changes in the coefficients

Coefficients can change when adding new predictors.

You might notice that the weight coefficient has changed between the two models. In modelA it is -0.012, but in model B it is -0.007

Why did it go down so much?

The coefficients are the change in the outcome for a 1 unit change in the predictor IF all the other predictors had stayed at the same value.

We describe this as the effect of additional weight, adjusting for how many cylinders the car had.


So, in this example, once we have adjusted for the number of cylinders, the remaining or marginal effect of weight is reduced (although not eliminated).



It can help to imagine what would happen if we fit a regression:

  1. For the whole sample, ignoring the number of cylinders (like in model A)
  2. Separately for cars of 3, 4, 5 … cylinders.


The plot below shows what this would look like:

The coefficient weight in model A is equivalent to the black line. The coefficient weight in model B, is approximately equivalent to the average of the red, green and blue slopes.


We can see that the average slope of the red, green and blue lines is less steep than the black line. That is, once we have accounted for the number of cylinders, our estimate of the relationship between weight and economy is reduced..

Evaluating models

How do we know which of the two models is the best?

One way is to check the residuals from each model (and plot them). Smaller residuals are better.

The \(R^2\) statistic formalises this somewhat. It is a number from 1 (the model predicts the outcome perfectly) to 0 (the model doesn’t predict anything).

Always report the ‘adjusted \(R^2\)’.

# calculate R squared
glance(modelA)
glance(modelB)

We saw previously that residuals represent the distance between the predicted and observed data. Larger residuals mean bigger errors in prediction, and amaller errors are better.

The plot below shows the distribution of residuals from models A and B.

Based on the distribution of the residuals, which model seems to make the best predictions?

Model B (representing Hypothesis B) makes better predictions, because more of the residuals are clustered around zero. That is, the error in the predictions is smaller.

R-squared (R2)

\(R^2\) is the proportion of variance in the outcome explained by the predictor variables.

\(R^2\) summarises how large the residuals are for different models, relative to the residual from a model without any predictors.

It ranges from 0 (models which are very poor and have large residuals), to 1 (models with perfect prediction and zero residuals). Because larger \(R^2\) values are better, we can use it to compare models with additional predictors.


To calculate \(R^2\) we use the glance command on our saved model result:

glance(modelA)

Explanation of the code: We use the glance, providing our saved result for modelA as the input.

Explanation of the output: glance produces a dataframe as a result, containing columns for r.squared, adj.r.squared, and some other statistics which we don’t need for now. The number to focus on here is adj.r.squared, which is 0.74.

Why use the adjusted \(R^2\), and what is the adjustment?

Unadjusted, \(R^2\) becomes increasingly over-optimistic as we add extra predictors to our model. The adjusted \(R^2\) figure corrects for this, reducing the raw \(R^2\) as we add extra predictors.

Always report the adjusted \(R^2\) value.

For more detail, see Field’s (2012) textbook, Discovering Statistics Using R, p. 273.



We can repeat the process for modelB:

glance(modelB)

Explanation of the output: This time we calculated adj.r.squared for modelB, which included cyl as an extra predictor. Adjusted \(R^2\) has now gone up to 0.82.



To summarise the results above:

  • We predicted fuel economy using weight alone (model A) and adjusted \(R^2\) was 0.75.
  • We added cyl (number of cylinders) in model B. Adjusted \(R^2\) increased to 0.82.

Models with higher \(R^2\) have smaller residuals and make better predictions. This means we should generally prefer model B.


But, if the increase in \(R^2\) is only small we might worry that the difference is due to chance — for example, because of random variations in our dataset. The final step is to quantify the evidence for each hypothesis.

Strength of the evidence

Our final task is to quantify how much evidence we have that model B is better than model A. For this we use a Bayes Factor.

library(BayesFactor)

modelA <- lmBF(mpg ~ weight, data=fuel)
modelB <- lmBF(mpg ~ weight + cyl, data=fuel)


modelB / modelA

As we saw in session 5, a Bayes Factor always compares two hypotheses. It tells us how much more likely one hypothesis is than an alternative hypothesis, given the data we collected.

We will use Bayes Factors to compare hypotheses A and B, which are represented by model A and model B.


To calculate the Bayes Factor we need to:

  • Load the BayesFactor package
  • Re-run our models using the lmBF function
  • Divide model B by model A


The complete code is as follows:

# load the package first
library(BayesFactor)

Next we re-run the models using the lmBF function. The only changes is to replace lm with lmBF.

# save model A
mA <- lmBF(mpg ~ weight, data=fuel)

# save model B
mB <- lmBF(mpg ~ weight + cyl, data=fuel)

Finally, we divide model B by model A to calculate the Bayes Factor.

# evidence in favour of model B
mB / mA
Bayes factor analysis
--------------
[1] weight + cyl : 22.7057 ±0.01%

Against denominator:
  mpg ~ weight 
---
Bayes factor type: BFlinearModel, JZS

Explanation of the code: We divide mB by mA using the / symbol.

Explanation of the output: Our code produces a Bayes Factor showing the evidence in favour of model B. The Bayes Factor is 22.71.

Interpreting the Bayes Factor

In session 5 we learned that a BF > 10 is generally considered strong evidence in favour of a hypothesis.

In this example the Bayes Factor is 22.71, so we have strong evidence in favour of model B, compared with model A.

Put differently model B is 22 times more likely than model A, given the data we collected.

Interpreting Bayes Factors

We previously said that:

  • BF’s > 3 are moderate evidence for a hypothesis, and
  • BF < 0.33 (\(\tfrac{1}{3}\)) are moderate evidence against a hypothesis.

Psychologists love a “line in the sand”, so a convention has emerged that we believe an effect is ‘real’ if the Bayes Factor is greater than 3, and believe there isn’t a ‘real’ effect if the Bayes Factor is less than 0.33.


In fact though, Bayes Factors are a continuous measure of the evidence for or against a particular model (hypothesis).

Several authors, including Kass & Raftery (1995), have written guidelines on how to interpret Bayes Factors in this way. They provide labels for Bayes Factors of various sizes. This is a more useful than a simple “line in a sand”, especially when comparing multiple studies.

A rule-of-thumb interpretation of Bayes Factors
Bayes Factor Interpretation
> 100 Extreme evidence for the hypothesis tested
30 - 100 Very strong evidence for the hypothesis tested
10 - 30 Strong evidence for the hypothesis tested
3 - 10 Moderate evidence for the hypothesis tested
1 - 3 Anecdotal evidence for the hypothesis tested

Bayes Factors can also be smaller than 1, but these indicate support for the null hypothesis. If this is the case, we often find it helpful to reverse the order of division so that the BF is larger than 1 and state explicitly in the text that the BF we are reporting indicates the support for the null hypothesis.

Imagine we are comparing two regression models, A and B. In the R output we see that the Bayes Factor for model A is 22, and the Bayes Factor for Model B is 56.

  • Which model is more probable?
  • The BF in favour of model B is:
  • The BF in favour of model A is:
  • For the BF in favour of model A, we calculate \(BF_{A} / BF_{B}\) = 22/56 = 0.4

  • For the BF in favour of model B, we calculate \(BF_{B} / BF_{A}\) = 56/22 = 2.5

Note that \(1 / 0.4 = 2.5\). We can use this trick (taking the reciprocal) to ‘reverse’ any Bayes Factor that is < 1 so that it expresses evidence in favour of the other hypothesis.

Overall synopsis

In the example above our outcome variable was fuel economy and we had two potential predictor variables: the car’s weight, and the number of cylinders it had.

We ran two models (A and B) which represented different hypotheses:

  • Hypothesis A said only weight was needed to predict fuel economy.
  • Hypothesis B said that both weight and cylinders would be independent predictors of fuel economy.


After fitting both models using lm we calculated \(R^2\) to estimate how well each model fitted the data.

\(R^2_{adjusted}\) was .75 form model A, and .82 for model B, which means both models were quite good. Both explained a large proportion of the variability in fuel economy.


To compare the models, we re-fitted them using lmBF and calculated a Bayes Factor. This told us that – although both were quite a good fit to the data – we had 22 times more evidence for model A compared with model B.


From this example we conclude that both weight and cylinders are independent predictors of fuel economy. That is, both variables make a useful contribution when predicting fuel economy.

Check your knowledge

  • We prefer models with residuals

  • The \(R^2\) statistic ranges from:

  • It’s better for a model to have a \(R^2\) value

  • If we calculate the Bayes Factor for modelA / modelB and the result is > 1, this provides evidence that

  • A Bayes Factor of 25 means we have evidence

  • A Bayes Factor < 1 means we have





Additional examples and exercises

To consolidate what we have learned this section asks you to follow the same process using a different dataset.



Student study habits

The studyhabits dataset (in the psydata package) contains data on students and their study habits:

-   work_hours: How many hours per week the student worked
-   work_consistently: A measure of how consistently the student worked each day
-   grade: Their assessment grade (%)

The tasks below use these data to pose a series of questions to check your understanding. They ask you to run additional analyses to built confidence using R.

Task 1: Interpret these plots

Based on these plots of the studyhabits data:

  • Hours worked and assessment grades are

  • Working consistently is likely to produce grades

  • Students who work more consistently tend to work for time overall

Task 2: Describe alternative hypotheses

Imagine we are interested in the effect of working consistently each day on grades, adjusting for the total number of hours studied. That is, we want to know if working consistently helps predict grades, even after we know the total hours spent studying.

A. Fill in the blanks in the following diagram:





B. Which hypothesis do you think is more likely, based on the plots and your own experience?

What is the correct code to run models for each of the hypotheses we described above?

  • Hypothesis A:

  • Hypothesis B:

Task 4: Interpret the Bayes Factor

Imagine you have already run the following code:

modA <- lmBF(grade ~ work_hours, data=studyhabits)
modB <- lmBF(grade ~ work_hours + work_consistently, data=studyhabits)
  • The command to display calculate the Bayes Factor in favour of Model B is:
modB / modA
Bayes factor analysis
--------------
[1] work_hours + work_consistently : 91.26606 ±0.01%

Against denominator:
  grade ~ work_hours 
---
Bayes factor type: BFlinearModel, JZS

Task 5: Interpret the Bayes Factor output

Imagine you had run the code below:

modA <- lmBF(grade ~ work_hours, data=studyhabits)
modB <- lmBF(grade ~ work_hours + work_consistently, data=studyhabits)

Interpret the following output:

modB 
Bayes factor analysis
--------------
[1] work_hours + work_consistently : 6.627368e+13 ±0%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS
  • We can interpret the Bayes Factor in this output as saying:
  • The Bayes Factor for model B is:

In the section above we only look at the output of modB. The default in the BayesFactor package is to give the BF for the comparison against a model with no predictors. This is sometimes what you want, but normally you will have to divide one model by another to get the correct Bayes Factor.

The BF number is written in scientific notation as 6.627368e+13 This means it is roughly \(6.627 \times 10^{13}\), so about 66273680000000.

Always double check when you see an e in a number in R output.

Useful: https://www.calculatorsoup.com/calculators/math/scientific-notation-converter.php

Now interpret the following output:

modB / modA
Bayes factor analysis
--------------
[1] work_hours + work_consistently : 91.26606 ±0.01%

Against denominator:
  grade ~ work_hours 
---
Bayes factor type: BFlinearModel, JZS
  • Which of the following options is true:

National happiness

This last exercise is not required; it is offered as an additional challenge for students who have time, or who wish to practice and consolidate skills learned.

The plots below shows the distribution of and relationship between three variables in the happy dataset, in the psydata package.

These data are a simplified version of the Gallup World Happiness report (and see more details on the questions asked here).

The variables shown are the national average of responses to three questions:

-   happiness: a measure of subjective well-being, rated on a 1-10 scale
-   social_support: responses to the question: "If you were in trouble, do you have relatives or
    friends you can count on to help you whenever you need them, or not?"
-   freedom_to_make_life_choices: responses to "Are you satisfied or dissatisfied with your
    freedom to choose what you do with your life?"

Using the happy data:

  • Write code to run linear models to predict happiness using the ‘social support’ and ‘freedom’ variables.

  • Draw diagrams to describe two alternative hypotheses, similar to those used above.

  • Calculate a Bayes Factor which compares those two alternative hypotheses. Which hypothesis is more likely?

There is no single set of ‘correct’ answers to these questions. If you would like feedback on your work, speak to a member of staff during the workshops.

References

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795.