Making predictions

Ben Whalley, Paul Sharpe, Sonja Heintz, Chris Berry

Overview

‘Linear regression’ is a fancy term for drawing the ‘best-fitting’ line through a scatterplot of two variables, and summarising how well the line describes the data.

When doing regression in R, the relationship between an outcome and one or more predictors is described using a formula. When a model is ‘fitted’ to a sample it becomes a tool to make predictions for future samples. It also allows us to quantify our uncertainty about those predictions.

Coefficients are numbers telling us how strong the relationships between predictors and outcomes are. But the meaning of coefficients depends entirely on the design of our study, and the assumptions we are prepared to make. Causal diagrams (as you drew in the first session) can help us decide how to interpret our results.

Learning outcomes

By the end of this session you will be able to:

  • Use numbers (coefficients) to interpret a line on a scatterplot
  • Run a simple linear regression model in R
  • Make a prediction from it
  • Check 3 key assumptions of the model

The dataset: Screen time and anxiety

Teychenne and Hinkley (2016) investigated the association between anxiety and daily hours of screen time (e.g., TV, computer, or device use) in 528 mothers with young children.

These data are available in the psydata package, in the mentalh dataset.

Open the data and list the variables in the mentalh dataset:

library(psydata)
glimpse(mentalh)

# you could also write this for more information on the variables
help(mentalh)

For this session we will mostly focus on:

  • screen_time: How much time the mothers spent on screens, and
  • anxiety_score: the number of anxiety symptoms experienced in the past week

Scatterplots

Showing relationships

In session 1 we learned how to make a scatterplot, using ggplot and geom_point. If we plot the mentalh data, we end up with something like this:

Do it: make a scatter plot

  1. As a refresher, make the plot shown above for yourself using:
  • the mentalh dataset
  • the ggplot and geom_point functions
  • the screen_time and anxiety_score columns
  • remember to use the aes function to select which variable appears on each axis
  1. Describe the relationship between these two variables in no more than 2 or 3 sentences.

Record this in your Rmd workbook for this session.

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() 

Adding best fit lines with geom_smooth()

To make predictions using simple regression we need to draw a line of best fit between one outcome variable and one predictor variable

  • The outcome is what we want to predict or explain (e.g., anxiety scores)

  • The predictor is what we use to predict the outcome variable (e.g., average hours of screen time per week)

  • Both the outcome and predictor variable must be continuous variables

In this session we stick to drawing straight lines because they represent the simplest-possible relationship.

The best way to ‘see’ a simple linear regression is to use the geom_smooth(). This adds a line of best fit to our existing scatter plot.

library(psydata)
library(tidyverse)

# make a scatterplot with a fitted line overlaid
mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() +
  geom_smooth(method=lm, se=F) # makes straight line with no shading

We can make a scatterplot as we have done previously:

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() 

It’s common that we place variables that we consider outcomes on the y axis of the plot and predictors on the x-axis, as here.

If we want to add a line to the plot we can use the geom_smooth function:

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() +
  geom_smooth()

By default geom_smooth draws a curved line through the data (it uses a technique called loess smoothing). It’s often quite useful to see the curve because it shows the trend in the relationship.

However in this case we want to see the best linear fit, because that will correspond with the simple linear regression we want to run.

To do that we just add method=lm to geom_smooth:

mentalh %>%
  ggplot(aes(screen_time, anxiety_score)) +
  geom_point() +
  geom_smooth(method=lm)

Now the blue line is the best linear fit between the predictor and the outcome.

  1. Recreate the plot in the video with screen_time and the predictor, anxiety_score as the outcome, and with the line of best fit (using geom_smooth(method=lm))

  2. Make two version: one with method="lm", and one without. The lines look different — but which do you think is the “best”? How would we decide? (We will return to this topic later)


  1. If things are going well and only if you have time, practice the technique using another dataset in the psydata package. Be sure to choose two continuous variables to plot.

Equation for a straight line

The line we draw using method="lm" is a simple linear regression. In fact, the name lm stands for linear model.

If we fit a straight line to a scatterplot, we can describe the line using two numbers, in an equation. This has the form:

\(\rm{Predicted\ outcome} = a + (b\times \rm{Predictor})\)

These numbers are called coefficients (this is the statistical term for a number which describes our model).

  • \(a\) is the intercept and represents the height of the line when the predictor is zero

  • \(b\) is the slope or coefficient for the predictor variable (screen_time). It says how steep the line is. Specifically it’s the change in the outcome we expect when the predictor increases by one unit.

You can see the intercept and slope highlighted in the graph below:

Check your understanding. Using the numbers shown in the plot above, answer the following questions:

  1. The expected value of the outcome when the predictor is zero is:
  2. If we increased the value of the predictor by 1, we would expect the outcome to increase by:
  3. If we increased the value of the predictor by 3, we would expect the outcome to increase by:
  1. The expected value is 13.00. This is the intercept.

  2. We would expect the outcome to increase by 5.67. This is equal to the slope coefficient for the predictor.

  3. We expect the outcome to increase by about 17. This is 3 times the slope coefficient, because the slope represents the change in the outcome for a single-unit change in the predictor.

Estimating the coefficients

So, how do we calculate the coefficients?

There are actually lots of ways to draw a ‘best-fit’ line, but simple linear regression uses the “method of least squares” (as described in the lecture).

The calculations can be complex, but the lm function in R will do them for us. This example shows how:

# conduct a simple regression to predict anxiety_score from screen_time.
lm(anxiety_score ~ screen_time, data = mentalh)

Call:
lm(formula = anxiety_score ~ screen_time, data = mentalh)

Coefficients:
(Intercept)  screen_time  
     5.5923       0.1318  

Explanation of the code

  • The command on line 2 tells R to use the lm function.
  • The first argument passed to lm is known as a model formula
  • The ~ symbol (called a ‘tilde’) means “is predicted by”. So you can read the formula as saying “anxiety score is predicted by screen time”.
  • Finally, we tell lm to use the mentalh data to estimate the coefficients.

Explanation of the output. When we use lm, the output displays:

  • The ‘Call’ we made (i.e. what inputs we gave, so we can remember how we did it later on; this is line 4 and 5)
  • The coefficients of the model. These are the numbers which represent the line on the graph we saw above.

Check your understanding again

Use the numbers in the output from lm to answer these questions:

  1. The expected anxiety score when screen time is zero is:

  2. If we increased the number of hours spent on screen by 1, we would expect anxiety scores to increase by:

  3. If we increased the number of hours on screen by 10, we would expect the outcome to increase by:

The answers are:

  1. 5.5923

  2. 0.1318

  3. 1.318

If these answers don’t make sense check back to the previous “check your understanding” section, or ask for some help.

Predictions

lm estimates coefficients to describe the line of best fit.

We can use these coefficients with the linear regression equation to make predictions for new individuals using simple arithmetic.

However, it’s more convenient to get R to make the predictions for us. To do this we need to:

  1. Re-run the model, and store the results
  2. Make a new dataframe, with the values of the predictor we will predict-for
  3. Use the augment command to make the predictions

The video covers both the manual and automatic methods of making a prediction from the results of lm.

# run an lm model, saving the results as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

# the tibble function makes a new data frame
new_data <- tibble(screen_time = 10)

# broom contains the augment function
library(broom)

# make predictions for our new dataframe (i.e. for screentime = 10)
augment(simple1, newdata=new_data)

Making a prediction by hand

To recap, we have seen that:

  • The intercept tells us what outcome to expect when the predictor is zero
  • The slope tells us how much the outcome will change if we increase the predictor by 1

We can combine these two facts to make predictions for individuals, including individuals we didn’t collect data for using basic arithmetic.

To do this we plug our coefficients back into the equation we saw above:

\(\rm{Anxiety\ score} = 5.5923 + (0.1318 \times \rm{Predictor})\)


So, for a person who had 10 hours on a screen, we calculate an anxiety score as follows:

\(\rm{Anxiety\ score} = 5.5923 + (0.1318 \times 10) = 6.9103 \rm{hours}\)



For speed, we can use R as a calculator:

# intercept + slope * 10 hours
5.5923 + 0.1318 * 10
[1] 6.9103

Explanation of output: The result, 6.9103, represents our prediction for someone who spends 10 hours on a screen.



Making a prediction using R

When models get more complex and include more predictors (as they will in the next session) it’s more convenient to get R to make the predictions for us.

To do this we need to:

  1. Re-run the lm, and store the results
  2. Make a new dataframe, with the values of the predictor we will predict-for
  3. Use the augment command to make the predictions

Step 1: re-run the model and save the results

# re-run the same lm model, this time saving the results as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

We can check this is the same model by using the name we stored it under as an R command:

simple1

Call:
lm(formula = anxiety_score ~ screen_time, data = mentalh)

Coefficients:
(Intercept)  screen_time  
     5.5923       0.1318  

Step 2: Make a new dataframe, with the values of the predictor we will predict-for

# the tibble function makes a new data frame
new_data <- tibble(screen_time = 10 )

We can check this has worked as expected in the same way:

new_data

Explanation of the output: We have a new dataframe, called new_data, which contains 1 row and 1 column (screen_time).

Step 3. Use the augment command to make the predictions.

The augment function is in the broom package, so we need to load that first:

library(broom)

Then we can use augment with the stored model (simple1), and the new dataframe (new_data):

augment(simple1, newdata=new_data)

Explanation of the output: The augment added new columns to the new_data dataframe. The predicted anxiety_score is in the .fitted column and is 6.91. The .se.fit column is the standard error of the predicted values (don’t worry about this for now).

Check your understanding: Coefficients

Using the coefficients in the lm output above, answer the following questions:

  1. What is the expected anxiety score for a person who watched a screen for 2 hours?

  2. What is the expected anxiety score for a person who watched a screen for 24 hours?


* 5.856 
* 8.755 

<!-- end of list -->

Try it: Making a prediction with augment

In your workbook:

  1. Re-run the lm model which predicts anxiety from screen time
  2. Save this model (give it a name like simple1 or model1)
  3. Make a new dataframe so that you can predict anxiety_score for someone who watched screens for 30 hours
  4. Use augment to make the predictions

The expected anxiety score for someone who watched screens for 30 hours is:

Residuals

We draw the line of best fit as close to the data as we can, but unless the model is perfect there will always be some distance between the individual data points and the fitted line.

These distances are called residuals.

Residuals are important when we:

  • Evaluate how well the model fits
  • Check that our model meets the assumptions of linear regression
# re-run simple linear model and save result as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

# use augment to get predictions and residuals for the original dataset
augment(simple1)

We draw the line of best fit as close to the data as we can, but unless the model is perfect there will always be some distance between the individual data points and the fitted line.

These distances are called residuals.

The term ‘residual’ is short for residual error in the prediction: that is, the difference between what we predict from the line and the actual data we collected.

We can see the residuals visualised in the plot below:

To calculate the residuals by hand we would subtract the predicted value (the point on the line) from the actual data point we observed: so we can say:

\(residual = fitted - observed\).

However, a more convenient way to calculate the residuals (as it was for predictions, above), is to use augment.

augment(simple1) %>%  # augment returns a dataframe of predictions and residuals
  head() %>%  # take only first 6 rows
  select(anxiety_score, screen_time, .fitted, .resid) # show only the columns we need

Explanation of the code and output:

  • We used augment without a newdata input. This meant it made a prediction and calculated the residual for every row in the original data file.
  • We used the head function to show only he first few rows and
  • used the select function to hide extra columns we are not interested in.
  • The output contains the prediction and residual for every row in the original dataset.

Check your understanding

  • The residual is the
  • Residuals can be calculated which which R function?
  • Residuals will be on the same scale as the:


Try it: Calculate the residuals

  • Make sure you have run an lm and saved the result as simple1, as above
  • Use augment on this saved model without specifying any new data


Check your result

Consider the person in the third row of the mentalh dataset we used to fit the model we saved as simple1:

  • What is their anxiety score?
  • How much screen time did they watch?
  • What is their predicted anxiety score?
  • What is the residual for this prediction?

Checking Assumptions

Regression makes three assumptions about the data that must be true if the results of the model are to be valid and useful:

  1. The ‘true’ best-fit line is straight (i.e. not curved)
  2. Model residuals are approximately normal
  3. The residuals are of equal size along the regression line

How to check

We can make 3 diagnostic plots to check that our assumptions have been met:

  1. We can use a regular scatter plot to check assumption 1. We covered this in previous sessions, and in the content above.

  2. To check assumption 2 we make a plot which shows the distribution of the residuals.

  3. For assumption 3, we can make a scatter plot of the predicted values, against the residuals

# save our model
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)

# use augment to make predictions/residuals
augment(simple1) %>% 
  head()

# make a histogram of residuals from the model (#2)
augment(simple1) %>% 
  ggplot(aes(.resid)) + geom_histogram()

# scatter plot of predicted values vs. residuals  (#3)
augment(simple1) %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth()  

NOTE: I have added a few extra points not mentioned in the video to the end of this transcript section

Checking for linearity

We have already covered this above, and we know that we can make a scatter plot with an added line to check that the relationship between predictor and outcome is broadly linear. For example, as we saw before:

Explanation of the output:

  • We can see that the relation between screen time and anxiety is, broadly, a straight line, so
  • the first assumption of linear regression has been met.

Note, it can be hard to tell if the relationship is linear or curved, especially when data are noisy (like here) or we only have a few samples. However, if the plot isn’t clear then assuming linearity is a good default.

Checking normality of residuals

We want our residuals to be normally distributed. To check the distribution of a variable we can make a histogram of the residuals.

The simplest way is to use the output of augment:

# make a histogram of residuals from the model saved as simple1
augment(simple1) %>% 
  ggplot(aes(.resid)) + geom_histogram()

Interpretation of the plot: These residuals aren’t perfectly normal, but they are close enough for our purposes here.

Homogeneity (consistency) of variance

We also hope that the residuals are roughly the same all the way along the fitted line.
This is sometimes called homogeneity of variance.

To check this we use another scatterplot with the predictor on the \(x\) axis, and predicted values (the points on the line) on the \(y\) axis. It’s nice to add a smoothed line to this plot, again using geom_smooth, to see the any trends more clearly.

In the output of augment, the .fitted column is the prediction (i.e. the points on the line), so we can plot them like this:

# make a scatterplot of predicted values vs residuals 
augment(simple1) %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth() 

Interpretation of the output:

  • The residuals are equally spaced above and below the line and
  • the pattern remains mostly the same across the range of fitted values.
  • As a result, the trend line is mostly flat.
  • It’s not perfect, but this example would be considered ‘good enough’ (i.e. does not violate the assumption).

Extra things I didn’t say in the video

In the example, we have far more data for people who watched < 10 hours of screen time than for people who watched more. This means that in the residuals plots shown, there are more residuals in the left hand side of the plot, and things are more sparse on the right. This doesn’t cause a problem for the homogeneity of variance assumption. For that, all we care about is the vertical spacing — how far the points are above/below the regression line – and that this spacing is equal across the range of screen time.

Try it: Checking assumptions

  • Re-run the lm model, again saving it as simple1

Use augment without specifying new data, and then make:

  • A histogram of the residuals
  • A scatterplot of fitted vs. residual values
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)
simple1.augment <- augment(simple1)
simple1.augment %>% 
  ggplot(aes(.resid)) + 
  geom_histogram()

simple1.augment %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth(method=lm)

Summary and key points

Simple regression

  • Simple regression fits a line to describe the relationship between and outcome and one predictor
  • The outcome must be a continuous variable
  • The slope and intercept coefficients form part of the regression equation used for prediction
  • These are estimated from the data using the lm function
  • A residual is the distance between the prediction and the real value
  • The augment function calculates predicted values and residuals for us
  • Regression makes 3 important assumptions that we should check using diagnostic plots before interpreting the model output

Key R code

Before continuing, check you are able to use each of these functions confidently:

  • geom_point() and geom_smooth(): scatter plots with fitted lines

  • lm( outcome ~ predictor, dataset): simple regression

  • model_1 <- lm( outcome ~ predictor, dataset): save model results as model_1

  • augment(model_1): Calculate residuals for a saved model

Check your knowledge

  • What is an intercept?

  • What is the slope?

  • What is a residual?

  • Which function fits a regression model?

  • Which function calculates residuals and predictions from a saved lm model?

  • What are the three assumptions of regression introduced in this session? (Make a note of your answers, then check them here)

Further reading

Please read the following material between the workshop sessions. If you have questions arising from it please bring them to the subsequent workshop and ask a member of staff there.

Why do regression assumptions matter?

We saw that regression models make three important assumptions:

  1. the ‘true’ best-fit line is straight (i.e. not curved)
  2. residuals are approximately normal
  3. residuals are of equal size along the regression line

There are two main reasons we should be suspicious about linear models that violate these assumptions:

  1. Predictions will be systematically wrong in some cases. For example, if the true relationship between predictor and outcome is curved then for some values of the predictor the model will be way-off. We may end up with predictions that are a long way from the true values.

  2. Inferences we make from the model (e.g. using p values or Bayes Factors — we will cover these in the next session) assume that the residuals are normally distributed. If this is not the case, then these statistics can be biased. Specifically, we may conclude that we have have more evidence for an effect than is really the case. This will lead us to draw the wrong conclusions from our data and, potentially, make bad decisions.

How bad is ‘too bad’?

Some students ask “how will we know if the assumptions have been violated?”. They worry when we say “residuals should be approximately normal”, because they don’t know how much divergence from normality would be ‘too much’.

In truth, there are no hard and fast criteria to check against for any of these assumptions. Sometimes it is obvious that assumptions have been violated, but often it is less clear cut.

For the moment, don’t worry too much because we will make any assumptions-checks for your assessment very clear-cut. You will know right away (using diagnostic plots) if the assumptions of your model have been met.

However, some specific examples are shown below to provide context.

Examples of assumptions violated

Here we show some unacceptable models, which do violate the regression assumptions.

I have made the same types of diagnostic plot shown above using different datasets and variables. I have added red lines showing the expectcted trend of the data, if the assumptions had been met

Example 1: National income and life expectancy

In this example we use the development dataset which contains data about the population and economy of countries around the world. The outcome of our model is the average life expectancy of each country in years. The predictor is the national income of a country, per capita (in $).

Diagnostics plots from a 'bad' regression model

Diagnostics plots from a ‘bad’ regression model

This model is problematic because:

  • the relationship between national income and life expectancy is not linear (there is a clear curve)
  • the residuals change in size across the range of predicted values


Interpretation: We should not accept this model, or interpret results from it, because it violates two important assumptions of linear regression.

There are possible fixes for these problems (see the next section), but all that is required for now is to:

  • identify when violations of assumptions have occurred, and remember that
  • interpreting models which violate assumptions is inappropriate

Example 2

In this example we use the diamonds dataset. The outcome of our model is the price of a diamond in £. The predictor is the size of the diamond (measured in carats).

Given these diagnostic plots, how should we judge the model?

  • The relationship looks linear:
  • The residuals are normally distributed:
  • The residuals are consistent across the range of fitted values:

Interpretation: Here all three assumptions are violated. The relationships looks curved, the residuals are definitely not normally-distributed, and the residuals get larger as the prediction increases.

You shouldn’t base inferences on a model like this.

Example 3

In this example we use the mtcars dataset, which contains data on cars. The outcome of our model is the fuel economy of cars in miles per gallon. The predictor is the weight of the car (measured in kg).

Given these diagnostic plots, how should we judge the model?

  • The relationship looks linear:
  • The residuals are normally distributed:
  • The residuals are consistent across the range of fitted values:

Interpretation: In this model it is hard to tell whether the true relationship is linear, in part because we only have 32 datapoints in the sample. Likewise, it’s hard to tell from the histogram if the resduals are normal. However the fitted vs residual plot does hint that residuals are not evenly-sized across the range of the fitted values, so it might be worth exploring further (see next section).

Example 4

In this example we use the iris dataset, which contains data on the width and length of flower petals for different species. The outcome of our model is the length of petals (in mm). The predictor is the width of the petals (mm).

Given these diagnostic plots, how should we judge the model?

  • The relationship looks linear:
  • The residuals are normally distributed:
  • The residuals are consistent across the range of fitted values:

In this case the relationship looks mostly linear, although there is a tendency for the relationship to tail off (curve) at higher values of petal width. The residuals are normally distributed, but they don’t appear to be consistent across the range of predicted values.

Interpretation: Strictly, we shouldn’t interpret this model as-is. If possible we should adapt our model, perhaps by adding predictors. Nonetheless, plenty of published research would report and base inferences on a model like this.

Extended exercises

This section is optional and not required for the module assessment.

Many students will not have time for this additional material. However it is provided as extended learning for those who may choose to use regression or related techniques in stage 4.

Understanding and fixing violations of assumptions

In an example above we saw that a model predicting life expectancy from GDP per-person lead us to violate some assumptions of regression. There are many different techniques which can be used to work around this kind of problem. Here we illustrate just one.

If we look at the data used for the model above, we can see that the relationship between GDP and life expectancy is not linear. We can see this because the points are in a curved shape, and so the line cannot get close to them all the way along the range of predictor values. This itself violates an assumption of regression. It also caused problems with the residuals, as we saw above.

To understand why this is the case, we can plot the distribution of national incomes:

Incomes are not normally distributed. In fact, they are very right skewed: most GDP values are quite low, but there is a long tail on the right hand side (a few countries have very large GDPs relative to the median).

This is quite typical of incomes data. At the level of countries, and for individuals, wealth is very unequally distributed. Some people are really a lot richer than the average.

Transforming the outcome

One option in these cases is to transform the outcome so that it is more normally distributed.

For example, if we plot the logarithm of GDP it becomes (a bit more) normally distributed.

The log function transforms the data and “squashes” the large values downwards, reducing the skew:

# make a log plot of incomes.. now less skewed
development %>% 
  ggplot(aes(log(gdp_per_capita))) + 
  geom_histogram() 

We can use this transformation to our advantage when we run our models. If we plot \(log(gdp)\) rather than \(gdp\) as our predictor, we can see that the relationship between the two variables is linear again:

development %>% 
  ggplot(aes(log(gdp_per_capita), life_expectancy)) + 
  geom_point() + 
  geom_smooth(method=lm, se=F) 

And this means if we use log(gdp_per_capita) as the predictor in an lm model, the model residuals no longer violate the assumptions (or at least, not too badly):

Diagnostics plots from a regression model using a log-transformed outcome.

Diagnostics plots from a regression model using a log-transformed outcome.

Try it for yourself:

  • Use the built in mtcars dataset.

  • Make a scatterplot of the mpg (miles per gallon) and hp (horsepower) variables

  • Add a fitted line using geom_smooth

  • Use a log() transform on the hp variable and make the plot again

  • Do either or both of the plots show a linear relationship? Which would be more suitable to use in a regression model?

  • If all is going well and you have time, you could try running this model using lm and making the diagnostic plots shown above using augment and ggplot.

# looks non-linear
mtcars %>% 
  ggplot(aes(hp, mpg)) + geom_point() + geom_smooth()

# looks more linear
mtcars %>% 
  ggplot(aes(log(hp), mpg)) + geom_point() + geom_smooth()

# run the model and make a fitted vs residual plot
# this isn't perfect, but the residuals look "ok"
lm(mpg ~ log(hp), mtcars) %>% 
  augment(newdata=mtcars) %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  geom_smooth(method=lm, se=F)