# 8 Anova

Be sure to read the section on linear models in R
*before* you read this section, and specifically the parts on
specifying models with formulae.

This section attempts to cover in a high level way how to specify anova models in R and some of the issues in interpreting the model output. If you need to revise the basic idea of an Anova, the Howell textbook [@howell2016fundamental]. For a very quick reminder, this interactive/animated explanation of Anova is helpful.

If you just want the ‘answers’ — i.e. the syntax to specify common Anova models – you could skip to the next section: Anova cookbook

There are 4 rules for doing Anova in R and not wanting to cry:

- Keep your data in ‘long’ format.
- Know the differences between character, factor and numeric variables
- Do not use the
`aov()`

or`anova()`

functions to get an Anova table unless you know what you are doing. - Learn about the types of sums of squares and always remember to specify
`type=3`

, unless you know better.

### Rules for using Anova in R

#### Rule 1: Use long format data

In R, data are almost always most useful a long format where:

- each row of the dataframe corresponds to a single measurement occasion
- each column corresponds to a variable which is measured

For example, in R we will have data like this:

```
df %>%
head %>%
pander
```

person | time | predictor | outcome |
---|---|---|---|

1 | 1 | 2 | 7 |

1 | 2 | 2 | 6 |

1 | 3 | 2 | 17 |

2 | 1 | 3 | 9 |

2 | 2 | 3 | 8 |

2 | 3 | 3 | 9 |

Whereas in SPSS we might have the same data structured like this:

```
df.wide %>%
head %>%
pander
```

person | predictor | Time 1 | Time 2 | Time 3 |
---|---|---|---|---|

1 | 2 | 7 | 6 | 17 |

2 | 3 | 9 | 8 | 9 |

3 | 2 | 7 | 8 | 10 |

4 | 2 | 12 | 10 | 6 |

5 | 4 | 11 | 13 | 15 |

6 | 4 | 8 | 9 | 12 |

R always uses long form data when running an Anova, but one downside is that it therefore has no automatic to know which rows belong to which person (assuming individual people are the unit of error in your model). This means that for repeated measures designs you need to make explicit which measures are repeated when specifying the model (see the section on repeated designs below).

#### Rule 2: Know your variables

See the section on dataframes and on the different column types and be sure you can distinguish:

- Numeric variables
- Factors
- Character strings.

In Anova:

- Outcomes will be numeric variables
- Predictors will be factors or (preferably) character strings

If you want to run Ancova models, you can also add numeric predictors.

#### Rule 3: Don’t use `aov()`

or `anova()`

This is the most important rule of all.

The `aov`

and `anova`

functions have been around in R a long time. For various
historical reasons the defaults for these functions won’t do what you expect if
you are used to SPSS, Stata, SAS, and most other stats packages. These
differences are important and will be confusing and give you misleading results
unless you understand them.

The recommendation here is:

If you have a factorial experiment define your model using

`lm()`

and then use`car::Anova()`

to calculate F tests.If you have repeated measures, your data are perfectly balanced, and you have no missing values then use

`afex::car_aov()`

.If you think you want a repeated measures Anova but your data are not balanced, or you have missing data, use linear mixed models instead via the

`lme4::`

package.

#### Rule 4: Use type 3 sums of squares (and learn why)

You may be aware, but there are at least 3 different ways of calculating the sums of squares for each factor and interaction in an Anova. In short,

- SPSS and most other packages use type 3 sums of squares.
`aov`

and`anova`

use type 1.- By default,
`car::Anova`

and`ez::ezANOVA`

use type 2, but can use type 3 if you ask.

This means you must:

- Make sure you use type 3 sums of squares unless you have a reason not to.
- Always pass
`type=3`

as an argument when running an Anova.

A longer explanation of *why* you probably want type 3 sums of squares is given
in this
online discussion on stats.stackechange.com
and practical implications are shown in
this worked example.

An even longer answer, including a much deeper exploration of the philosophical questions involved is given by @venables1998exegeses.

### Recommendations for doing Anova

Make sure to Plot your raw data

*first*Where you have interactions, be cautious in interpreting the main effects in your model, and always plot the model predictions.

If you find yourself aggregating (averaging) data before running your model, think about using a mixed or multilevel model instead.

If you are using repeated measures Anova, check if you should should be using a mixed model instead. If you have an unbalanced design or any missing data, you probably should use a mixed model.