Overview
‘Linear regression’ is a fancy term for drawing the ‘best-fitting’ line through a scatterplot of two variables, and summarising how well the line describes the data.
When doing regression in R, the relationship between an outcome and one or more predictors is described using a formula. When a model is ‘fitted’ to a sample it becomes a tool to make predictions for future samples. It also allows us to quantify our uncertainty about those predictions.
Coefficients are numbers telling us how strong the relationships between predictors and outcomes are. But the meaning of coefficients depends entirely on the design of our study, and the assumptions we are prepared to make. Causal diagrams (as you drew in the first session) can help us decide how to interpret our results.
Learning outcomes
By the end of this session you will be able to:
- Use numbers (coefficients) to interpret a line on a scatterplot
- Run a simple linear regression model in R
- Make a prediction from it
- Check 3 key assumptions of the model
Topics covered
The dataset: Screen time and anxiety
Teychenne and Hinkley (2016) investigated the association between anxiety and daily hours of screen time (e.g., TV, computer, or device use) in 528 mothers with young children.
These data are available in the psydata
package, in the
mentalh
dataset.
Open the data and list the variables in the mentalh
dataset:
For this session we will mostly focus on:
screen_time
: How much time the mothers spent on screens, andanxiety_score
: the number of anxiety symptoms experienced in the past week
Scatterplots
Showing relationships
In session 1 we learned how to
make a scatterplot, using ggplot
and
geom_point
. If we plot the mentalh
data, we
end up with something like this:
Do it: make a scatter plot
- As a refresher, make the plot shown above for yourself using:
- the
mentalh
dataset - the
ggplot
andgeom_point
functions - the
screen_time
andanxiety_score
columns - remember to use the
aes
function to select which variable appears on each axis
- Describe the relationship between these two variables in no more than 2 or 3 sentences.
Record this in your Rmd workbook for this session.
Adding best fit lines with geom_smooth()
To make predictions using simple regression we need to draw a line of best fit between one outcome variable and one predictor variable
The outcome is what we want to predict or explain (e.g., anxiety scores)
The predictor is what we use to predict the outcome variable (e.g., average hours of screen time per week)
Both the outcome and predictor variable must be continuous variables
In this session we stick to drawing straight lines because they represent the simplest-possible relationship.
The best way to ‘see’ a simple linear regression is to use the
geom_smooth()
. This adds a line of best fit to our existing
scatter plot.
We can make a scatterplot as we have done previously:
It’s common that we place variables that we consider outcomes on the y axis of the plot and predictors on the x-axis, as here.
If we want to add a line to the plot we can use the geom_smooth function:
By default geom_smooth draws a curved line through the data (it uses a technique called loess smoothing). It’s often quite useful to see the curve because it shows the trend in the relationship.
However in this case we want to see the best linear fit, because that will correspond with the simple linear regression we want to run.
To do that we just add method=lm
to
geom_smooth
:
Now the blue line is the best linear fit between the predictor and the outcome.
Recreate the plot in the video with
screen_time
and the predictor,anxiety_score
as the outcome, and with the line of best fit (usinggeom_smooth(method=lm)
)Make two version: one with
method="lm"
, and one without. The lines look different — but which do you think is the “best”? How would we decide? (We will return to this topic later)
- If things are going well and only if you have time,
practice the technique using another dataset in the
psydata
package. Be sure to choose two continuous variables to plot.
Equation for a straight line
The line we draw using method="lm"
is a simple linear
regression. In fact, the name lm
stands for linear
model.
If we fit a straight line to a scatterplot, we can describe the line using two numbers, in an equation. This has the form:
\(\rm{Predicted\ outcome} = a + (b\times \rm{Predictor})\)
These numbers are called coefficients (this is the statistical term for a number which describes our model).
\(a\) is the intercept and represents the height of the line when the predictor is zero
\(b\) is the slope or coefficient for the predictor variable (
screen_time
). It says how steep the line is. Specifically it’s the change in the outcome we expect when the predictor increases by one unit.
You can see the intercept and slope highlighted in the graph below:
Warning in geom_segment(aes(x = 0, xend = 0, y = 0, yend = egdata.model.coefs[1]), : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_segment(aes(x = 2, xend = 3, y = 2 * egdata.model.coefs[2] + : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_segment(aes(x = 3, xend = 3, y = 2 * egdata.model.coefs[2] + : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_text(aes(label = "+1", x = 2.5, y = -4 + 2.5 * egdata.model.coefs[2] + : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_text(aes(label = str_interp("slope (${egdata.model.coefs.s[2]})"), : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_text(aes(label = str_interp("intercept (${egdata.model.coefs.s[1]})"), : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Check your understanding. Using the numbers shown in the plot above, answer the following questions:
- The expected value of the outcome when the predictor is zero is:
- If we increased the value of the predictor by 1, we would expect the outcome to increase by:
- If we increased the value of the predictor by 3, we would expect the outcome to increase by:
The expected value is 13.00. This is the intercept.
We would expect the outcome to increase by 5.67. This is equal to the slope coefficient for the predictor.
We expect the outcome to increase by about 17. This is 3 times the slope coefficient, because the slope represents the change in the outcome for a single-unit change in the predictor.
Estimating the coefficients
So, how do we calculate the coefficients?
There are actually lots of ways to draw a ‘best-fit’ line, but simple linear regression uses the “method of least squares” (as described in the lecture).
The calculations can be complex, but the lm
function in
R will do them for us. This example shows how:
# conduct a simple regression to predict anxiety_score from screen_time.
lm(anxiety_score ~ screen_time, data = mentalh)
Call:
lm(formula = anxiety_score ~ screen_time, data = mentalh)
Coefficients:
(Intercept) screen_time
5.5923 0.1318
Explanation of the code
- The command on line 2 tells R to use the
lm
function. - The first argument passed to
lm
is known as a model formula - The
~
symbol (called a ‘tilde’) means “is predicted by”. So you can read the formula as saying “anxiety score is predicted by screen time”. - Finally, we tell
lm
to use thementalh
data to estimate the coefficients.
Explanation of the output. When we use
lm
, the output displays:
- The ‘Call’ we made (i.e. what inputs we gave, so we can remember how we did it later on; this is line 4 and 5)
- The coefficients of the model. These are the numbers which represent the line on the graph we saw above.
Check your understanding again
Use the numbers in the output from lm
to answer these
questions:
The expected anxiety score when screen time is zero is:
If we increased the number of hours spent on screen by 1, we would expect anxiety scores to increase by:
If we increased the number of hours on screen by 10, we would expect the outcome to increase by:
The answers are:
5.5923
0.1318
1.318
If these answers don’t make sense check back to the previous “check your understanding” section, or ask for some help.
Predictions
lm
estimates coefficients to describe the line of best
fit.
We can use these coefficients with the linear regression equation to make predictions for new individuals using simple arithmetic.
However, it’s more convenient to get R to make the predictions for us. To do this we need to:
- Re-run the model, and store the results
- Make a new dataframe, with the values of the predictor we will predict-for
- Use the
augment
command to make the predictions
The video covers both the manual and automatic methods of making a
prediction from the results of lm
.
# run an lm model, saving the results as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)
# the tibble function makes a new data frame
new_data <- tibble(screen_time = 10)
# broom contains the augment function
library(broom)
# make predictions for our new dataframe (i.e. for screentime = 10)
augment(simple1, newdata=new_data)
Making a prediction by hand
To recap, we have seen that:
- The intercept tells us what outcome to expect when the predictor is zero
- The slope tells us how much the outcome will change if we increase the predictor by 1
We can combine these two facts to make predictions for individuals, including individuals we didn’t collect data for using basic arithmetic.
To do this we plug our coefficients back into the equation we saw above:
\(\rm{Anxiety\ score} = 5.5923 + (0.1318 \times \rm{Predictor})\)
So, for a person who had 10 hours on a screen, we calculate an anxiety score as follows:
\(\rm{Anxiety\ score} = 5.5923 + (0.1318 \times 10) = 6.9103 \rm{hours}\)
For speed, we can use R as a calculator:
[1] 6.9103
Explanation of output: The result,
6.9103
, represents our prediction for someone who spends 10
hours on a screen.
Making a prediction using R
When models get more complex and include more predictors (as they will in the next session) it’s more convenient to get R to make the predictions for us.
To do this we need to:
- Re-run the
lm
, and store the results - Make a new dataframe, with the values of the predictor we will predict-for
- Use the
augment
command to make the predictions
Step 1: re-run the model and save the results
# re-run the same lm model, this time saving the results as `simple1`
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)
We can check this is the same model by using the name we stored it under as an R command:
Call:
lm(formula = anxiety_score ~ screen_time, data = mentalh)
Coefficients:
(Intercept) screen_time
5.5923 0.1318
Step 2: Make a new dataframe, with the values of the predictor we will predict-for
We can check this has worked as expected in the same way:
Explanation of the output: We have a new dataframe,
called new_data
, which contains 1 row and 1 column
(screen_time
).
Step 3. Use the augment
command to make the
predictions.
The augment
function is in the broom
package, so we need to load that first:
Then we can use augment
with the stored model
(simple1
), and the new dataframe
(new_data
):
Explanation of the output: The augment
added new columns to the new_data
dataframe. The predicted
anxiety_score
is in the .fitted
column and is
6.91. The .se.fit
column is the standard error of
the predicted values (don’t worry about this for now).
Check your understanding: Coefficients
Using the coefficients in the lm
output above, answer
the following questions:
What is the expected anxiety score for a person who watched a screen for 2 hours?
What is the expected anxiety score for a person who watched a screen for 24 hours?
* 5.856
* 8.755
<!-- end of list -->
Try it: Making a prediction with augment
In your workbook:
- Re-run the
lm
model which predicts anxiety from screen time - Save this model (give it a name like
simple1
ormodel1
) - Make a new dataframe so that you can predict
anxiety_score
for someone who watched screens for 30 hours - Use
augment
to make the predictions
The expected anxiety score for someone who watched screens for 30 hours is:
Residuals
We draw the line of best fit as close to the data as we can, but unless the model is perfect there will always be some distance between the individual data points and the fitted line.
These distances are called residuals.
Residuals are important when we:
- Evaluate how well the model fits
- Check that our model meets the assumptions of linear regression
We draw the line of best fit as close to the data as we can, but unless the model is perfect there will always be some distance between the individual data points and the fitted line.
These distances are called residuals.
The term ‘residual’ is short for residual error in the prediction: that is, the difference between what we predict from the line and the actual data we collected.
We can see the residuals visualised in the plot below:
Warning in geom_point(aes(x = predictor[26], y = outcome[26]), color = "orange"): All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_segment(aes(x = predictor[26], xend = predictor[26], yend = outcome[26] - : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_point(aes(x = predictor[26], y = outcome[26]), color = "orange"): All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
Warning in geom_segment(aes(x = predictor[26], xend = predictor[26], yend = outcome[26] - : All aesthetics have length 1, but the data has 36 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
To calculate the residuals by hand we would subtract the predicted value (the point on the line) from the actual data point we observed: so we can say:
\(residual = fitted - observed\).
However, a more convenient way to calculate the residuals (as it was
for predictions, above), is to use
augment
.
augment(simple1) %>% # augment returns a dataframe of predictions and residuals
head() %>% # take only first 6 rows
select(anxiety_score, screen_time, .fitted, .resid) # show only the columns we need
Explanation of the code and output:
- We used
augment
without anewdata
input. This meant it made a prediction and calculated the residual for every row in the original data file. - We used the
head
function to show only he first few rows and - used the
select
function to hide extra columns we are not interested in. - The output contains the prediction and residual for every row in the original dataset.
Check your understanding
- The residual is the
- Residuals can be calculated which which R function?
- Residuals will be on the same scale as the:
Try it: Calculate the residuals
- Make sure you have run an
lm
and saved the result assimple1
, as above - Use
augment
on this saved model without specifying any new data
Check your result
Consider the person in the third row of the
mentalh
dataset we used to fit the model we saved as
simple1
:
- What is their anxiety score?
- How much screen time did they watch?
- What is their predicted anxiety score?
- What is the residual for this prediction?
Checking Assumptions
Regression makes three assumptions about the data that must be true if the results of the model are to be valid and useful:
- The ‘true’ best-fit line is straight (i.e. not curved)
- Model residuals are approximately normal
- The residuals are of equal size along the regression line
How to check
We can make 3 diagnostic plots to check that our assumptions have been met:
We can use a regular scatter plot to check assumption 1. We covered this in previous sessions, and in the content above.
To check assumption 2 we make a plot which shows the distribution of the residuals.
For assumption 3, we can make a scatter plot of the predicted values, against the residuals
# save our model
simple1 <- lm(anxiety_score ~ screen_time, data = mentalh)
# use augment to make predictions/residuals
augment(simple1) %>%
head()
# make a histogram of residuals from the model (#2)
augment(simple1) %>%
ggplot(aes(.resid)) + geom_histogram()
# scatter plot of predicted values vs. residuals (#3)
augment(simple1) %>%
ggplot(aes(.fitted, .resid)) +
geom_point() +
geom_smooth()
NOTE: I have added a few extra points not mentioned in the video to the end of this transcript section
Checking for linearity
We have already covered this above, and we know that we can make a scatter plot with an added line to check that the relationship between predictor and outcome is broadly linear. For example, as we saw before:
Explanation of the output:
- We can see that the relation between screen time and anxiety is, broadly, a straight line, so
- the first assumption of linear regression has been met.
Note, it can be hard to tell if the relationship is linear or curved, especially when data are noisy (like here) or we only have a few samples. However, if the plot isn’t clear then assuming linearity is a good default.
Checking normality of residuals
We want our residuals to be normally distributed. To check the distribution of a variable we can make a histogram of the residuals.
The simplest way is to use the output of augment
:
# make a histogram of residuals from the model saved as simple1
augment(simple1) %>%
ggplot(aes(.resid)) + geom_histogram()
Interpretation of the plot: These residuals aren’t perfectly normal, but they are close enough for our purposes here.
Homogeneity (consistency) of variance
We also hope that the residuals are roughly the same all the way
along the fitted line.
This is sometimes called homogeneity of variance.
To check this we use another scatterplot with the predictor
on the \(x\) axis, and predicted
values (the points on the line) on the \(y\) axis. It’s nice to add a smoothed line
to this plot, again using geom_smooth
, to see the any
trends more clearly.
In the output of augment
, the .fitted
column is the prediction (i.e. the points on the line), so we can plot
them like this:
# make a scatterplot of predicted values vs residuals
augment(simple1) %>%
ggplot(aes(.fitted, .resid)) +
geom_point() +
geom_smooth()
Interpretation of the output:
- The residuals are equally spaced above and below the line and
- the pattern remains mostly the same across the range of fitted values.
- As a result, the trend line is mostly flat.
- It’s not perfect, but this example would be considered ‘good enough’ (i.e. does not violate the assumption).
Extra things I didn’t say in the video
In the example, we have far more data for people who watched < 10 hours of screen time than for people who watched more. This means that in the residuals plots shown, there are more residuals in the left hand side of the plot, and things are more sparse on the right. This doesn’t cause a problem for the homogeneity of variance assumption. For that, all we care about is the vertical spacing — how far the points are above/below the regression line – and that this spacing is equal across the range of screen time.
Try it: Checking assumptions
- Re-run the
lm
model, again saving it assimple1
Use augment
without specifying new data, and
then make:
- A histogram of the residuals
- A scatterplot of fitted vs. residual values
Summary and key points
Simple regression
- Simple regression fits a line to describe the relationship between and outcome and one predictor
- The outcome must be a continuous variable
- The slope and intercept coefficients form part of the regression equation used for prediction
- These are estimated from the data using the
lm
function - A residual is the distance between the prediction and the real value
- The
augment
function calculates predicted values and residuals for us - Regression makes 3 important assumptions that we should check using diagnostic plots before interpreting the model output
Key R code
Before continuing, check you are able to use each of these functions confidently:
geom_point()
andgeom_smooth()
: scatter plots with fitted lineslm( outcome ~ predictor, dataset)
: simple regressionmodel_1 <- lm( outcome ~ predictor, dataset)
: save model results asmodel_1
augment(model_1)
: Calculate residuals for a saved model
Check your knowledge
What is an intercept?
What is the slope?
What is a residual?
Which function fits a regression model?
Which function calculates residuals and predictions from a saved
lm
model?What are the three assumptions of regression introduced in this session? (Make a note of your answers, then check them here)
Further reading
Please read the following material between the workshop sessions. If you have questions arising from it please bring them to the subsequent workshop and ask a member of staff there.
Why do regression assumptions matter?
We saw that regression models make three important assumptions:
- the ‘true’ best-fit line is straight (i.e. not curved)
- residuals are approximately normal
- residuals are of equal size along the regression line
There are two main reasons we should be suspicious about linear models that violate these assumptions:
Predictions will be systematically wrong in some cases. For example, if the true relationship between predictor and outcome is curved then for some values of the predictor the model will be way-off. We may end up with predictions that are a long way from the true values.
Inferences we make from the model (e.g. using p values or Bayes Factors — we will cover these in the next session) assume that the residuals are normally distributed. If this is not the case, then these statistics can be biased. Specifically, we may conclude that we have have more evidence for an effect than is really the case. This will lead us to draw the wrong conclusions from our data and, potentially, make bad decisions.
How bad is ‘too bad’?
Some students ask “how will we know if the assumptions have been violated?”. They worry when we say “residuals should be approximately normal”, because they don’t know how much divergence from normality would be ‘too much’.
In truth, there are no hard and fast criteria to check against for any of these assumptions. Sometimes it is obvious that assumptions have been violated, but often it is less clear cut.
For the moment, don’t worry too much because we will make any assumptions-checks for your assessment very clear-cut. You will know right away (using diagnostic plots) if the assumptions of your model have been met.
However, some specific examples are shown below to provide context.
Examples of assumptions violated
Here we show some unacceptable models, which do violate the regression assumptions.
I have made the same types of diagnostic plot shown above using different datasets and variables. I have added red lines showing the expectcted trend of the data, if the assumptions had been met
Example 1: National income and life expectancy
In this example we use the development
dataset which
contains data about the population and economy of countries around the
world. The outcome of our model is the average life expectancy of each
country in years. The predictor is the national income of a country, per
capita (in $).
This model is problematic because:
- the relationship between national income and life expectancy is not linear (there is a clear curve)
- the residuals change in size across the range of predicted values
Interpretation: We should not accept this model, or interpret results from it, because it violates two important assumptions of linear regression.
There are possible fixes for these problems (see the next section), but all that is required for now is to:
- identify when violations of assumptions have occurred, and remember that
- interpreting models which violate assumptions is inappropriate
Example 2
In this example we use the diamonds
dataset. The outcome
of our model is the price of a diamond in £. The predictor is the size
of the diamond (measured in carats).
Given these diagnostic plots, how should we judge the model?
- The relationship looks linear:
- The residuals are normally distributed:
- The residuals are consistent across the range of fitted values:
Interpretation: Here all three assumptions are violated. The relationships looks curved, the residuals are definitely not normally-distributed, and the residuals get larger as the prediction increases.
You shouldn’t base inferences on a model like this.
Example 3
In this example we use the mtcars
dataset, which
contains data on cars. The outcome of our model is the fuel economy of
cars in miles per gallon. The predictor is the weight of the car
(measured in kg).
Given these diagnostic plots, how should we judge the model?
- The relationship looks linear:
- The residuals are normally distributed:
- The residuals are consistent across the range of fitted values:
Interpretation: In this model it is hard to tell whether the true relationship is linear, in part because we only have 32 datapoints in the sample. Likewise, it’s hard to tell from the histogram if the resduals are normal. However the fitted vs residual plot does hint that residuals are not evenly-sized across the range of the fitted values, so it might be worth exploring further (see next section).
Example 4
In this example we use the iris
dataset, which contains
data on the width and length of flower petals for different species. The
outcome of our model is the length of petals (in mm). The predictor is
the width of the petals (mm).
Given these diagnostic plots, how should we judge the model?
- The relationship looks linear:
- The residuals are normally distributed:
- The residuals are consistent across the range of fitted values:
In this case the relationship looks mostly linear, although there is a tendency for the relationship to tail off (curve) at higher values of petal width. The residuals are normally distributed, but they don’t appear to be consistent across the range of predicted values.
Interpretation: Strictly, we shouldn’t interpret this model as-is. If possible we should adapt our model, perhaps by adding predictors. Nonetheless, plenty of published research would report and base inferences on a model like this.
Extended exercises
This section is optional and not required for the module assessment.
Many students will not have time for this additional material. However it is provided as extended learning for those who may choose to use regression or related techniques in stage 4.
Understanding and fixing violations of assumptions
In an example above we saw that a model predicting life expectancy from GDP per-person lead us to violate some assumptions of regression. There are many different techniques which can be used to work around this kind of problem. Here we illustrate just one.
If we look at the data used for the model above, we can see that the relationship between GDP and life expectancy is not linear. We can see this because the points are in a curved shape, and so the line cannot get close to them all the way along the range of predictor values. This itself violates an assumption of regression. It also caused problems with the residuals, as we saw above.
To understand why this is the case, we can plot the distribution of national incomes:
Incomes are not normally distributed. In fact, they are very right skewed: most GDP values are quite low, but there is a long tail on the right hand side (a few countries have very large GDPs relative to the median).
This is quite typical of incomes data. At the level of countries, and for individuals, wealth is very unequally distributed. Some people are really a lot richer than the average.
Transforming the outcome
One option in these cases is to transform the outcome so that it is more normally distributed.
For example, if we plot the logarithm of GDP it becomes (a bit more) normally distributed.
The log function transforms the data and “squashes” the large values downwards, reducing the skew:
# make a log plot of incomes.. now less skewed
development %>%
ggplot(aes(log(gdp_per_capita))) +
geom_histogram()
We can use this transformation to our advantage when we run our models. If we plot \(log(gdp)\) rather than \(gdp\) as our predictor, we can see that the relationship between the two variables is linear again:
development %>%
ggplot(aes(log(gdp_per_capita), life_expectancy)) +
geom_point() +
geom_smooth(method=lm, se=F)
And this means if we use log(gdp_per_capita)
as the
predictor in an lm
model, the model residuals no longer
violate the assumptions (or at least, not too badly):
Try it for yourself:
Use the built in
mtcars
dataset.Make a scatterplot of the
mpg
(miles per gallon) andhp
(horsepower) variablesAdd a fitted line using
geom_smooth
Use a
log()
transform on thehp
variable and make the plot againDo either or both of the plots show a linear relationship? Which would be more suitable to use in a regression model?
If all is going well and you have time, you could try running this model using
lm
and making the diagnostic plots shown above usingaugment
andggplot
.