# 16 Making predictions

Objectives of this section:

- Distingish predicted means (predictions) from predicted effects (‘margins’)
- Calculate both predictions and marginal effects for a
`lm()`

- Plot predictions and margins
- Think about how to plot effects in meaningful ways

### Predictions vs margins

Before we start, let’s consider what we’re trying to achieve in making predictions from our models. We need to make a distinction between:

- Predicted means
- Predicted effects or
*marginal effects*

Consider the example used in a previous section where we measured
`injury.severity`

after road accidents, plus two predictor variables: `gender`

and `age`

.

### Predicted means

‘Predicted means’ (or predictions) refers to our best estimate for each category
of person we’re interested in. For example, if `age`

were categorical (
i.e. young vs. older people) then might have 4 predictions to calculate from our
model:

Age | Gender | mean |
---|---|---|

Young | Male | ? |

Old | Male | ? |

Young | Female | ? |

Old | Female | ? |

And as before, we might plot these data:

This plot uses the raw data, but these points could equally have been estimated from a statistical model which adjusted for other predictors.

*Effects* (margins)

Terms like: *predicted effects*, *margins* or *marginal effects* refer, instead,
to the effect of one predictor.

There may be more than one marginal effect because *the effect of one predictor
can change across the range of another predictor*.

Extending the example above, if we take the difference between men and women for each category of age, we can plot these differences. The steps we need to go through are:

- Reshape the data to be wide, including a separate column for injury scores for men and women
- Subtract the score for men from that of women, to calculate the effect of being female
- Plot this difference score

```
margins.plot <- inter.df %>%
# reshape the data to a wider format
reshape2::dcast(older~female) %>%
# calculate the difference between men and women for each age
mutate(effect.of.female = Female - Male) %>%
# plot the difference
ggplot(aes(older, effect.of.female, group=1)) +
geom_point() +
geom_line() +
ylab("Effect of being female") + xlab("") +
geom_hline(yintercept = 0)
margins.plot
```

As before, these differences use the raw data, but *could* have been calculated
from a statistical model. In the section below we do this, making predictions
for means and marginal effects from a `lm()`

.

### Continuous predictors

In the examples above, our data were all categorical, which mean that it was straightforward to identify categories of people for whom we might want to make a prediction (i.e. young men, young women, older men, older women).

However, `age`

is typically measured as a continuous variable, and we would want
to use a grouped scatter plot to see this:

```
injuries %>%
ggplot(aes(age, severity.of.injury, group=gender, color=gender)) +
geom_point(size=1) +
scale_color_discrete(name="")
```

But to make predictions from this continuous data we need to fit a line through
the points (i.e. run a model). We can do this graphically by calling
`geom_smooth()`

which attempts to fit a smooth line through the data we observe:

```
injuries %>%
ggplot(aes(age, severity.of.injury, group=gender, color=gender)) +
geom_point(alpha=.2, size=1) +
geom_smooth(se=F)+
scale_color_discrete(name="")
```

And if we are confident that the relationships between predictor and outcome are
sufficiently *linear*, then we can ask ggplot to fit a straight line using
linear regression:

```
injuries %>%
ggplot(aes(age, severity.of.injury, group=gender, color=gender)) +
geom_point(alpha = .1, size = 1) +
geom_smooth(se = F, linetype="dashed") +
geom_smooth(method = "lm", se = F) +
scale_color_discrete(name="")
```

What these plots illustrate is the steps a researcher might take *before*
fitting a regression model. The straight lines in the final plot represent our
best guess for a person of a given age and gender, assuming a linear regression.

We can read from these lines to make a point prediction for men and women of a specific age, and use the information about our uncertainty in the prediction, captured by the model, to estimate the likely error.

To make our findings simpler to communicate, we might want to make estimates at specific ages and plot these. These ages could be:

- Values with biological or cultural meaning: for example 18 (new driver) v.s. 65 (retirement age)
- Statistical convention (e.g. median, 25th, and 75th centile, or mean +/- 1 SD)

We’ll see examples of both below.