7 Regression
This section assumes most readers will have done an introductory statistics course and had some practice running multiple regression and or Anova in SPSS or a similar package.
Describing statistical models using formulae
R requires that you are explicit about the statistical model you want to run but
provides a neat, concise way of describing models, called a formula
. For
multiple regression and simple Anova, the formulas we write map closely onto the
underlying linear model. The formula syntax provides shortcuts to quickly
describe all the models you are likely to need.
Formulas have two parts: the left hand side and the right hand side, which are
separated by the tilde symbol: ~
. Here, the tilde just means ‘is predicted
by’.
For example, this formula specifies a regression model where height
is the
outcome, and age
and gender
are the predictor variables.8
height ~ age + gender
There are lots more useful tricks to learn when writing formulas, which are covered below. But in the interests of instant gratification let’s work through a simple example first:
Running a linear model
Linear models (including Anova and multiple regression) are run using the
lm(...)
function, short for ‘linear model’. We will use the mtcars
dataset,
which is built into R, for our first example.
First, we have a quick look at the data. The pairs plot suggests that mpg
might be related to a number of the other variables including disp
(engine
size) and wt
(car weight):
mtcars %>%
select(mpg, disp, wt) %>%
pairs
Before running any model, we should ask outselves: “what question we are trying to answer?”
In this instance, we can see that both weight (wt
) and engine size (disp
)
are related to mpg
, but they are also correlated with one another. We might
want to know, then, “are weight and engine size independent predictors of
mpg
?” That is, if we know a car’s weight, do we gain additional information
about it’s mpg
by measuring engine size?
To answer this, we could use multiple regression, including both wt
and disp
as predictors of mpg
. The formula for this model would be mpg ~ wt + disp
.
The command below runs the model:
lm(mpg ~ wt + disp, data=mtcars)
Call:
lm(formula = mpg ~ wt + disp, data = mtcars)
Coefficients:
(Intercept) wt disp
34.96055 -3.35083 -0.01772
For readers used to wading through reams of SPSS output R might seem concise to
the point of rudeness. By default, the lm
commands displays very little, only
repeating the formula and listing the coefficients for each predictor in the
model.
So what next? Unlike SPSS, we must be explicit and tell R exactly what we want.
The most convenient way to do this is to first store the results of the lm()
function:
m.1 <- lm(mpg ~ wt + disp, data=mtcars)
This stores the results of the lm()
function in a variable named m.1
. As an
aside, this is a pretty terrible variable name — try to give descriptive names
to your variables because this will prevent errors and make your code easier to
read.
We can then use other functions to get more information about the model. For example:
summary(m.1)
Call:
lm(formula = mpg ~ wt + disp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4087 -2.3243 -0.7683 1.7721 6.3484
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.96055 2.16454 16.151 4.91e-16 ***
wt -3.35082 1.16413 -2.878 0.00743 **
disp -0.01773 0.00919 -1.929 0.06362 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.917 on 29 degrees of freedom
Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658
F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10
Although still compact, the summary
function provides some familiar output,
including the estimate, SE, and p value for each parameter.
Take a moment to find the following statistics in the output above:
- The coefficients and p values for each predictor
- The R2 for the overall model. What % of variance in
mpg
is explained?
Answer the original question: ‘accounting for weight (wt
), does engine size
(disp
) tell us anything extra about a car’s mpg
?’
More on formulas
Above we briefly introduced R’s formula syntax. Formulas for linear models have the following structure:
left_hand_side ~ right_hand_side
For linear models the left side is our outcome, which is must be a continous
variable. For categorical or binary outcomes you need to use glm()
function,
rather than lm()
. See the section on generalised linear models)
for more details.
The right hand side of the formula lists our predictors. In the example above
we used the +
symbol to separate the predictors wt
and disp
. This told R
to simply add each predictor to the model. However, many times we want to
specify relationships between our predictors, as well as between predictors
and outcomes.
For example, we might have an experiment with 2 categorical predictors, each with 2 levels — that is, a 2x2 between-subjects design.
Below, we define and run a linear model with both vs
and am
as predictors,
along with the interaction of vs:am
. We save this model as m.2
, and use the
summary
command to print the coefficients.
m.2 <- lm(mpg ~ vs + am + vs:am, data=mtcars)
summary(m.2)
Call:
lm(formula = mpg ~ vs + am + vs:am, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-6.971 -1.973 0.300 2.036 6.250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.050 1.002 15.017 6.34e-15 ***
vs 5.693 1.651 3.448 0.0018 **
am 4.700 1.736 2.708 0.0114 *
vs:am 2.929 2.541 1.153 0.2589
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.472 on 28 degrees of freedom
Multiple R-squared: 0.7003, Adjusted R-squared: 0.6682
F-statistic: 21.81 on 3 and 28 DF, p-value: 1.735e-07
We’d normally want to see the Anova table for this model, including the F-tests:
car::Anova(m.2)
Anova Table (Type II tests)
Response: mpg
Sum Sq Df F value Pr(>F)
vs 367.41 1 30.4836 6.687e-06 ***
am 276.03 1 22.9021 4.984e-05 ***
vs:am 16.01 1 1.3283 0.2589
Residuals 337.48 28
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But before you do too much with Anova in R read this section.
Other formula shortcuts
In addition to the +
symbol, we can use other shortcuts to create linear
models.
As seen above, the colon (:
) operator indicates the interaction between two
terms. So a:b
is equivalent to creating a new variable in the data frame where
a
is multiplied by b
.
The *
symbol indicates the expansion of other terms in the model. So, a*b
is
the equivalent of a + b + a:b
.
Finally, it’s good to know that other functions can be used within R formulas to
save work. For example, if you wanted to transform your dependent variable then
log(y) ~ x
will do what you might expect, and saves creating temporary
variables in your dataset.
The formula syntax is very powerful, and the above only shows the basics, but
you can read the formulae
help pages in RStudio for more details.
Run the following models using the mtcars dataset:
With
mpg
as the outcome, and withcyl
andhp
as predictorsAs above, but adding the interaction of
cyl
andhp
.Repeat the model above, but write the formula a different way (make the formula either more or less explicit, but retaining the same predictors in the model).
I avoid the terms dependent/independent variables because they are confusing to many students, and because they are misleading when discussing non-experimental data.↩