Factors and variable codings

If you store categorical data as numbers (e.g. groups 1, 2, 3 …) it’s important to make sure your predictors are entered correctly into your models.

In general, R works in a ‘regressiony’ way and will assume variables in a formula are linear predictors. So, a group variable coded 1…4 will be entered as a single parameter where 4 is considered twice as large as 2, etc.

See below for example. In the first model cyl is entered as a ‘linear slope’; in the second each value of cyl (4,5, or 6) is treated as a separate category. The predictions from each model could be very different:

linear.model <- lm(mpg ~ cyl, data=mtcars)
categorical.model <- lm(mpg ~ factor(cyl), data=mtcars)

In the case of different experimental groups what you would normally want is for group to be coded and entered as a number of categorical parameters in your model. The most common way of doing this is to use ‘dummy coding’, and this is what R will implement by default for character or factor variables.

To make sure your categorical variables are entered into your model as categories (and not a slope) you can either:

Convert the variable to a character or factor type in the dataframe or
Specify that the variable is a factor when you run the model

For example, here we specify cyl is a factor within the model formula:

lm(mpg ~ factor(cyl), data=mtcars)

Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)

Coefficients:
 (Intercept)  factor(cyl)6  factor(cyl)8  
      26.664        -6.921       -11.564

Whereas here we convert to a factor in the original dataset:

mtcars$cyl.factor <- factor(mtcars$cyl)
lm(mpg ~ cyl.factor, data=mtcars)

Call:
lm(formula = mpg ~ cyl.factor, data = mtcars)

Coefficients:
(Intercept)  cyl.factor6  cyl.factor8  
     26.664       -6.921      -11.564

Neither option is universally better, but if you have variables which are definitely factors (i.e. should never be used as slopes) it’s probably better to convert them in the original dataframe, before you start modelling