Overview

‘Linear regression’ is a fancy term for drawing the ‘best-fitting’ line through a scatterplot of two variables, and summarising how well the line describes the data.

When doing regression in R, the relationship between an outcome and one or more predictors is described using a formula. When a model is ‘fitted’ to a sample it becomes a tool to make predictions for future samples. It also allows us to quantify our uncertainty about those predictions.

Coefficients are numbers telling us how strong the relationships between predictors and outcomes are. But the meaning of coefficients depends entirely on the design of our study, and the assumptions we are prepared to make. Causal diagrams (as you drew in the first session) can help us decide how to interpret our results.

Learning outcomes

By the end of this session you will be able to:

Use numbers (coefficients) to interpret a line on a scatterplot
Run a simple linear regression model in R
Make a prediction from it
Check 3 key assumptions of the model

Topics covered

Revision for scatter plots
Adding straight lines to scatter plots
Explanation of the regression equation (describing a straight line in numbers)
The regression line as a prediction
Residuals as the error in prediction
Three assumptions regression makes

The dataset: Screen time and anxiety

Teychenne and Hinkley (2016) investigated the association between anxiety and daily hours of screen time (e.g., TV, computer, or device use) in 528 mothers with young children.

These data are available in the psydata package, in the mentalh dataset.

Open the data and list the variables in the mentalh dataset:

library(psydata)
glimpse(mentalh)

# you could also write this for more information on the variables
help(mentalh)

For this session we will mostly focus on:

screen_time: How much time the mothers spent on screens, and
anxiety_score: the number of anxiety symptoms experienced in the past week

Scatterplots

Showing relationships

In session 1 we learned how to make a scatterplot, using ggplot and geom_point. If we plot the mentalh data, we end up with something like this:

Do it: make a scatter plot

As a refresher, make the plot shown above for yourself using:

the mentalh dataset
the ggplot and geom_point functions
the screen_time and anxiety_score columns
remember to use the aes function to select which variable appears on each axis

Describe the relationship between these two variables in no more than 2 or 3 sentences.

Record this in your Rmd workbook for this session.

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point()

Adding best fit lines with `geom_smooth()`

To make predictions using simple regression we need to draw a line of best fit between one outcome variable and one predictor variable

The outcome is what we want to predict or explain (e.g., anxiety scores)
The predictor is what we use to predict the outcome variable (e.g., average hours of screen time per week)
Both the outcome and predictor variable must be continuous variables

In this session we stick to drawing straight lines because they represent the simplest-possible relationship.

The best way to ‘see’ a simple linear regression is to use the geom_smooth(). This adds a line of best fit to our existing scatter plot.

library(psydata)
library(tidyverse)

# make a scatterplot with a fitted line overlaid
mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() +
  geom_smooth(method=lm, se=F) # makes straight line with no shading

We can make a scatterplot as we have done previously:

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point()

It’s common that we place variables that we consider outcomes on the y axis of the plot and predictors on the x-axis, as here.

If we want to add a line to the plot we can use the geom_smooth function:

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() +
  geom_smooth()

By default geom_smooth draws a curved line through the data (it uses a technique called loess smoothing). It’s often quite useful to see the curve because it shows the trend in the relationship.

However in this case we want to see the best linear fit, because that will correspond with the simple linear regression we want to run.

To do that we just add method=lm to geom_smooth:

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() +
  geom_smooth(method=lm)

Now the blue line is the best linear fit between the predictor and the outcome.

Recreate the plot in the video with screen_time and the predictor, anxiety_score as the outcome, and with the line of best fit (using geom_smooth(method=lm))
Make two version: one with method="lm", and one without. The lines look different — but which do you think is the “best”? How would we decide? (We will return to this topic later)

If things are going well and only if you have time, practice the technique using another dataset in the psydata package. Be sure to choose two continuous variables to plot.

Equation for a straight line

The line we draw using method="lm" is a simple linear regression. In fact, the name lm stands for linear model.

If we fit a straight line to a scatterplot, we can describe the line using two numbers, in an equation. This has the form:

\(\rm{Predicted\ outcome} = a + (b\times \rm{Predictor})\)

These numbers are called coefficients (this is the statistical term for a number which describes our model).

\(a\) is the intercept and represents the height of the line when the predictor is zero
\(b\) is the slope or coefficient for the predictor variable (screen_time). It says how steep the line is. Specifically it’s the change in the outcome we expect when the predictor increases by one unit.

You can see the intercept and slope highlighted in the graph below:

Warning in geom_segment(aes(x = 0, xend = 0, y = 0, yend = egdata.model.coefs[1]), : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_segment(aes(x = 2, xend = 3, y = 2 * egdata.model.coefs[2] + : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_segment(aes(x = 3, xend = 3, y = 2 * egdata.model.coefs[2] + : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_text(aes(label = "+1", x = 2.5, y = -4 + 2.5 * egdata.model.coefs[2] + : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_text(aes(label = str_interp("slope (${egdata.model.coefs.s[2]})"), : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_text(aes(label = str_interp("intercept (${egdata.model.coefs.s[1]})"), : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Check your understanding. Using the numbers shown in the plot above, answer the following questions:

The expected value of the outcome when the predictor is zero is:
If we increased the value of the predictor by 1, we would expect the outcome to increase by:
If we increased the value of the predictor by 3, we would expect the outcome to increase by:

The expected value is 13.00. This is the intercept.
We would expect the outcome to increase by 5.67. This is equal to the slope coefficient for the predictor.
We expect the outcome to increase by about 17. This is 3 times the slope coefficient, because the slope represents the change in the outcome for a single-unit change in the predictor.

Estimating the coefficients

So, how do we calculate the coefficients?

There are actually lots of ways to draw a ‘best-fit’ line, but simple linear regression uses the “method of least squares” (as described in the lecture).

The calculations can be complex, but the lm function in R will do them for us. This example shows how:

# conduct a simple regression to predict anxiety_score from screen_time.
lm(anxiety_score ~ screen_time, data = mentalh)


Call:
lm(formula = anxiety_score ~ screen_time, data = mentalh)

Coefficients:
(Intercept)  screen_time  
     5.5923       0.1318

Explanation of the code

The command on line 2 tells R to use the lm function.
The first argument passed to lm is known as a model formula
The ~ symbol (called a ‘tilde’) means “is predicted by”. So you can read the formula as saying “anxiety score is predicted by screen time”.
Finally, we tell lm to use the mentalh data to estimate the coefficients.

Explanation of the output. When we use lm, the output displays:

The ‘Call’ we made (i.e. what inputs we gave, so we can remember how we did it later on; this is line 4 and 5)
The coefficients of the model. These are the numbers which represent the line on the graph we saw above.

Check your understanding again

Use the numbers in the output from lm to answer these questions:

The expected anxiety score when screen time is zero is:
If we increased the number of hours spent on screen by 1, we would expect anxiety scores to increase by:
If we increased the number of hours on screen by 10, we would expect the outcome to increase by:

The answers are:

5.5923
0.1318
1.318

If these answers don’t make sense check back to the previous “check your understanding” section, or ask for some help.

Predictions

lm estimates coefficients to describe the line of best fit.

We can use these coefficients with the linear regression equation to make predictions for new individuals using simple arithmetic.

However, it’s more convenient to get R to make the predictions for us. To do this we need to:

Re-run the model, and store the results
Make a new dataframe, with the values of the predictor we will predict-for
Use the augment command to make the predictions

The video covers both the manual and automatic methods of making a prediction from the results of lm.

# run an lm model, saving the results as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

# the tibble function makes a new data frame
new_data <- tibble(screen_time = 10)

# broom contains the augment function
library(broom)

# make predictions for our new dataframe (i.e. for screentime = 10)
augment(simple1, newdata=new_data)

Making a prediction by hand

To recap, we have seen that:

The intercept tells us what outcome to expect when the predictor is zero
The slope tells us how much the outcome will change if we increase the predictor by 1

We can combine these two facts to make predictions for individuals, including individuals we didn’t collect data for using basic arithmetic.

To do this we plug our coefficients back into the equation we saw above:

\(\rm{Anxiety\ score} = 5.5923 + (0.1318 \times \rm{Predictor})\)

So, for a person who had 10 hours on a screen, we calculate an anxiety score as follows:

\(\rm{Anxiety\ score} = 5.5923 + (0.1318 \times 10) = 6.9103 \rm{hours}\)

For speed, we can use R as a calculator:

# intercept + slope * 10 hours
5.5923 + 0.1318 * 10

[1] 6.9103

Explanation of output: The result, 6.9103, represents our prediction for someone who spends 10 hours on a screen.

Making a prediction using R

When models get more complex and include more predictors (as they will in the next session) it’s more convenient to get R to make the predictions for us.

To do this we need to:

Re-run the lm, and store the results
Make a new dataframe, with the values of the predictor we will predict-for
Use the augment command to make the predictions

Step 1: re-run the model and save the results

# re-run the same lm model, this time saving the results as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

We can check this is the same model by using the name we stored it under as an R command:

simple1


Call:
lm(formula = anxiety_score ~ screen_time, data = mentalh)

Coefficients:
(Intercept)  screen_time  
     5.5923       0.1318

Step 2: Make a new dataframe, with the values of the predictor we will predict-for

# the tibble function makes a new data frame
new_data <- tibble(screen_time = 10 )

We can check this has worked as expected in the same way:

new_data

Explanation of the output: We have a new dataframe, called new_data, which contains 1 row and 1 column (screen_time).

Step 3. Use the `augment` command to make the predictions.

The augment function is in the broom package, so we need to load that first:

library(broom)

Then we can use augment with the stored model (simple1), and the new dataframe (new_data):

augment(simple1, newdata=new_data)

Explanation of the output: The augment added new columns to the new_data dataframe. The predicted anxiety_score is in the .fitted column and is 6.91. The .se.fit column is the standard error of the predicted values (don’t worry about this for now).

Check your understanding: Coefficients

Using the coefficients in the lm output above, answer the following questions:

What is the expected anxiety score for a person who watched a screen for 2 hours?
What is the expected anxiety score for a person who watched a screen for 24 hours?


* 5.856 
* 8.755 

<!-- end of list -->

Try it: Making a prediction with augment

In your workbook:

Re-run the lm model which predicts anxiety from screen time
Save this model (give it a name like simple1 or model1)
Make a new dataframe so that you can predict anxiety_score for someone who watched screens for 30 hours
Use augment to make the predictions

The expected anxiety score for someone who watched screens for 30 hours is:

Residuals

We draw the line of best fit as close to the data as we can, but unless the model is perfect there will always be some distance between the individual data points and the fitted line.

These distances are called residuals.

Residuals are important when we:

Evaluate how well the model fits
Check that our model meets the assumptions of linear regression

# re-run simple linear model and save result as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

# use augment to get predictions and residuals for the original dataset
augment(simple1)

We draw the line of best fit as close to the data as we can, but unless the model is perfect there will always be some distance between the individual data points and the fitted line.

These distances are called residuals.

The term ‘residual’ is short for residual error in the prediction: that is, the difference between what we predict from the line and the actual data we collected.

We can see the residuals visualised in the plot below:

Warning in geom_point(aes(x = predictor[26], y = outcome[26]), color = "orange"): All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_segment(aes(x = predictor[26], xend = predictor[26], yend = outcome[26] - : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_point(aes(x = predictor[26], y = outcome[26]), color = "orange"): All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

Warning in geom_segment(aes(x = predictor[26], xend = predictor[26], yend = outcome[26] - : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

To calculate the residuals by hand we would subtract the predicted value (the point on the line) from the actual data point we observed: so we can say:

\(residual = fitted - observed\).

However, a more convenient way to calculate the residuals (as it was for predictions, above), is to use augment.

augment(simple1) %>%  # augment returns a dataframe of predictions and residuals
  head() %>%  # take only first 6 rows
  select(anxiety_score, screen_time, .fitted, .resid) # show only the columns we need

Explanation of the code and output:

We used augment without a newdata input. This meant it made a prediction and calculated the residual for every row in the original data file.
We used the head function to show only he first few rows and
used the select function to hide extra columns we are not interested in.
The output contains the prediction and residual for every row in the original dataset.

Check your understanding

The residual is the
Residuals can be calculated which which R function?
Residuals will be on the same scale as the:

Try it: Calculate the residuals

Make sure you have run an lm and saved the result as simple1, as above
Use augment on this saved model without specifying any new data

Check your result

Consider the person in the third row of the mentalh dataset we used to fit the model we saved as simple1:

What is their anxiety score?
How much screen time did they watch?
What is their predicted anxiety score?
What is the residual for this prediction?

Checking Assumptions

Regression makes three assumptions about the data that must be true if the results of the model are to be valid and useful:

The ‘true’ best-fit line is straight (i.e. not curved)
Model residuals are approximately normal
The residuals are of equal size along the regression line

How to check

We can make 3 diagnostic plots to check that our assumptions have been met:

We can use a regular scatter plot to check assumption 1. We covered this in previous sessions, and in the content above.
To check assumption 2 we make a plot which shows the distribution of the residuals.
For assumption 3, we can make a scatter plot of the predicted values, against the residuals

# save our model
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

# use augment to make predictions/residuals
augment(simple1) %>% 
  head()

# make a histogram of residuals from the model (#2)
augment(simple1) %>% 
  ggplot(aes(.resid)) + geom_histogram()

# scatter plot of predicted values vs. residuals  (#3)
augment(simple1) %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth()

NOTE: I have added a few extra points not mentioned in the video to the end of this transcript section

Checking for linearity

We have already covered this above, and we know that we can make a scatter plot with an added line to check that the relationship between predictor and outcome is broadly linear. For example, as we saw before:

Explanation of the output:

We can see that the relation between screen time and anxiety is, broadly, a straight line, so
the first assumption of linear regression has been met.

Note, it can be hard to tell if the relationship is linear or curved, especially when data are noisy (like here) or we only have a few samples. However, if the plot isn’t clear then assuming linearity is a good default.

Checking normality of residuals

We want our residuals to be normally distributed. To check the distribution of a variable we can make a histogram of the residuals.

The simplest way is to use the output of augment:

# make a histogram of residuals from the model saved as simple1
augment(simple1) %>% 
  ggplot(aes(.resid)) + geom_histogram()

Interpretation of the plot: These residuals aren’t perfectly normal, but they are close enough for our purposes here.

Homogeneity (consistency) of variance

We also hope that the residuals are roughly the same all the way along the fitted line.
This is sometimes called homogeneity of variance.

To check this we use another scatterplot with the predictor on the \(x\) axis, and predicted values (the points on the line) on the \(y\) axis. It’s nice to add a smoothed line to this plot, again using geom_smooth, to see the any trends more clearly.

In the output of augment, the .fitted column is the prediction (i.e. the points on the line), so we can plot them like this:

# make a scatterplot of predicted values vs residuals 
augment(simple1) %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth()

Interpretation of the output:

The residuals are equally spaced above and below the line and
the pattern remains mostly the same across the range of fitted values.
As a result, the trend line is mostly flat.
It’s not perfect, but this example would be considered ‘good enough’ (i.e. does not violate the assumption).

Extra things I didn’t say in the video

In the example, we have far more data for people who watched < 10 hours of screen time than for people who watched more. This means that in the residuals plots shown, there are more residuals in the left hand side of the plot, and things are more sparse on the right. This doesn’t cause a problem for the homogeneity of variance assumption. For that, all we care about is the vertical spacing — how far the points are above/below the regression line – and that this spacing is equal across the range of screen time.

Try it: Checking assumptions

Re-run the lm model, again saving it as simple1

Use augment without specifying new data, and then make:

A histogram of the residuals
A scatterplot of fitted vs. residual values

simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)
simple1.augment <- augment(simple1)
simple1.augment %>% 
  ggplot(aes(.resid)) + 
  geom_histogram()

simple1.augment %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth(method=lm)

Summary and key points

Simple regression

Simple regression fits a line to describe the relationship between and outcome and one predictor
The outcome must be a continuous variable
The slope and intercept coefficients form part of the regression equation used for prediction
These are estimated from the data using the lm function
A residual is the distance between the prediction and the real value
The augment function calculates predicted values and residuals for us
Regression makes 3 important assumptions that we should check using diagnostic plots before interpreting the model output

Key R code

Before continuing, check you are able to use each of these functions confidently:

geom_point() and geom_smooth(): scatter plots with fitted lines
lm( outcome ~ predictor, dataset): simple regression
model_1 <- lm( outcome ~ predictor, dataset): save model results as model_1
augment(model_1): Calculate residuals for a saved model

Check your knowledge

What is an intercept?
What is the slope?
What is a residual?
Which function fits a regression model?
Which function calculates residuals and predictions from a saved lm model?
What are the three assumptions of regression introduced in this session? (Make a note of your answers, then check them here)

Extended exercises

This section is optional and not required for the module assessment.

Many students will not have time for this additional material. However it is provided as extended learning for those who may choose to use regression or related techniques in stage 4.

Understanding and fixing violations of assumptions

In an example above we saw that a model predicting life expectancy from GDP per-person lead us to violate some assumptions of regression. There are many different techniques which can be used to work around this kind of problem. Here we illustrate just one.

If we look at the data used for the model above, we can see that the relationship between GDP and life expectancy is not linear. We can see this because the points are in a curved shape, and so the line cannot get close to them all the way along the range of predictor values. This itself violates an assumption of regression. It also caused problems with the residuals, as we saw above.

To understand why this is the case, we can plot the distribution of national incomes:

Incomes are not normally distributed. In fact, they are very right skewed: most GDP values are quite low, but there is a long tail on the right hand side (a few countries have very large GDPs relative to the median).

This is quite typical of incomes data. At the level of countries, and for individuals, wealth is very unequally distributed. Some people are really a lot richer than the average.

Transforming the outcome

One option in these cases is to transform the outcome so that it is more normally distributed.

For example, if we plot the logarithm of GDP it becomes (a bit more) normally distributed.

The log function transforms the data and “squashes” the large values downwards, reducing the skew:

# make a log plot of incomes.. now less skewed
development %>% 
  ggplot(aes(log(gdp_per_capita))) + 
  geom_histogram()

We can use this transformation to our advantage when we run our models. If we plot \(log(gdp)\) rather than \(gdp\) as our predictor, we can see that the relationship between the two variables is linear again:

development %>% 
  ggplot(aes(log(gdp_per_capita), life_expectancy)) + 
  geom_point() + 
  geom_smooth(method=lm, se=F)

And this means if we use log(gdp_per_capita) as the predictor in an lm model, the model residuals no longer violate the assumptions (or at least, not too badly):

Diagnostics plots from a regression model using a log-transformed outcome.

Try it for yourself:

Use the built in mtcars dataset.
Make a scatterplot of the mpg (miles per gallon) and hp (horsepower) variables
Add a fitted line using geom_smooth
Use a log() transform on the hp variable and make the plot again
Do either or both of the plots show a linear relationship? Which would be more suitable to use in a regression model?
If all is going well and you have time, you could try running this model using lm and making the diagnostic plots shown above using augment and ggplot.

# looks non-linear
mtcars %>% 
  ggplot(aes(hp, mpg)) + geom_point() + geom_smooth()

# looks more linear
mtcars %>% 
  ggplot(aes(log(hp), mpg)) + geom_point() + geom_smooth()

# run the model and make a fitted vs residual plot
# this isn't perfect, but the residuals look "ok"
lm(mpg ~ log(hp), mtcars) %>% 
  augment(newdata=mtcars) %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth(method=lm, se=F)

Making predictions

Ben Whalley, Paul Sharpe, Sonja Heintz, Chris Berry

Overview

Learning outcomes

Topics covered

The dataset: Screen time and anxiety

Scatterplots

Showing relationships

Adding best fit lines with `geom_smooth()`

Equation for a straight line

Estimating the coefficients

Predictions

Making a prediction by hand

Making a prediction using R

Step 1: re-run the model and save the results

Step 2: Make a new dataframe, with the values of the predictor we will predict-for

Step 3. Use the `augment` command to make the predictions.

Residuals

Checking Assumptions

How to check

Checking for linearity

Checking normality of residuals

Homogeneity (consistency) of variance

Summary and key points

Simple regression

Key R code

Check your knowledge

Further reading

Why do regression assumptions matter?

How bad is ‘too bad’?

Examples of assumptions violated

Example 1: National income and life expectancy

Example 2

Example 3

Example 4

Extended exercises

Understanding and fixing violations of assumptions

Transforming the outcome

Making predictions

Ben Whalley, Paul Sharpe, Sonja Heintz, Chris Berry

Overview

Learning outcomes

Topics covered

The dataset: Screen time and anxiety

Scatterplots

Showing relationships

Adding best fit lines with geom_smooth()

Equation for a straight line

Estimating the coefficients

Predictions

Making a prediction by hand

Making a prediction using R

Step 1: re-run the model and save the results

Step 2: Make a new dataframe, with the values of the predictor we will predict-for

Step 3. Use the augment command to make the predictions.

Residuals

Checking Assumptions

How to check

Checking for linearity

Checking normality of residuals

Homogeneity (consistency) of variance

Summary and key points

Simple regression

Key R code

Check your knowledge

Further reading

Why do regression assumptions matter?

How bad is ‘too bad’?

Examples of assumptions violated

Example 1: National income and life expectancy

Example 2

Example 3

Example 4

Extended exercises

Understanding and fixing violations of assumptions

Transforming the outcome

Adding best fit lines with `geom_smooth()`

Step 3. Use the `augment` command to make the predictions.