8 Anova

Be sure to read the section on linear models in R before you read this section, and specifically the parts on specifying models with formulae.

This section attempts to cover in a high level way how to specify anova models in R and some of the issues in interpreting the model output. If you need to revise the basic idea of an Anova, the Howell textbook [@howell2016fundamental]. For a very quick reminder, this interactive/animated explanation of Anova is helpful.

If you just want the ‘answers’ — i.e. the syntax to specify common Anova models – you could skip to the next section: Anova cookbook

There are 4 rules for doing Anova in R and not wanting to cry:

  1. Keep your data in ‘long’ format.
  2. Know the differences between character, factor and numeric variables
  3. Do not use the aov() or anova() functions to get an Anova table unless you know what you are doing.
  4. Learn about the types of sums of squares and always remember to specify type=3, unless you know better.

Rules for using Anova in R

Rule 1: Use long format data

In R, data are almost always most useful a long format where:

  • each row of the dataframe corresponds to a single measurement occasion
  • each column corresponds to a variable which is measured

For example, in R we will have data like this:

df %>%
  head %>%
  pander
person time predictor outcome
1 1 2 7
1 2 2 6
1 3 2 17
2 1 3 9
2 2 3 8
2 3 3 9

Whereas in SPSS we might have the same data structured like this:

df.wide %>%
    head %>%
    pander
person predictor Time 1 Time 2 Time 3
1 2 7 6 17
2 3 9 8 9
3 2 7 8 10
4 2 12 10 6
5 4 11 13 15
6 4 8 9 12

R always uses long form data when running an Anova, but one downside is that it therefore has no automatic to know which rows belong to which person (assuming individual people are the unit of error in your model). This means that for repeated measures designs you need to make explicit which measures are repeated when specifying the model (see the section on repeated designs below).

Rule 2: Know your variables

See the section on dataframes and on the different column types and be sure you can distinguish:

  • Numeric variables
  • Factors
  • Character strings.

In Anova:

  • Outcomes will be numeric variables
  • Predictors will be factors or (preferably) character strings

If you want to run Ancova models, you can also add numeric predictors.

Rule 3: Don’t use aov() or anova()

This is the most important rule of all.

The aov and anova functions have been around in R a long time. For various historical reasons the defaults for these functions won’t do what you expect if you are used to SPSS, Stata, SAS, and most other stats packages. These differences are important and will be confusing and give you misleading results unless you understand them.

The recommendation here is:

  • If you have a factorial experiment define your model using lm() and then use car::Anova() to calculate F tests.

  • If you have repeated measures, your data are perfectly balanced, and you have no missing values then use afex::car_aov().

  • If you think you want a repeated measures Anova but your data are not balanced, or you have missing data, use linear mixed models instead via the lme4:: package.

Rule 4: Use type 3 sums of squares (and learn why)

You may be aware, but there are at least 3 different ways of calculating the sums of squares for each factor and interaction in an Anova. In short,

  • SPSS and most other packages use type 3 sums of squares.
  • aov and anova use type 1.
  • By default, car::Anova and ez::ezANOVA use type 2, but can use type 3 if you ask.

This means you must:

  • Make sure you use type 3 sums of squares unless you have a reason not to.
  • Always pass type=3 as an argument when running an Anova.

A longer explanation of why you probably want type 3 sums of squares is given in this online discussion on stats.stackechange.com and practical implications are shown in this worked example.

An even longer answer, including a much deeper exploration of the philosophical questions involved is given by @venables1998exegeses.

Recommendations for doing Anova

  1. Make sure to Plot your raw data first

  2. Where you have interactions, be cautious in interpreting the main effects in your model, and always plot the model predictions.

  3. If you find yourself aggregating (averaging) data before running your model, think about using a mixed or multilevel model instead.

  4. If you are using repeated measures Anova, check if you should should be using a mixed model instead. If you have an unbalanced design or any missing data, you probably should use a mixed model.