Factors and variable codings
If you store categorical data as numbers (e.g. groups 1
, 2
, 3
…) it’s
important to make sure your predictors are entered correctly into your models.
In general, R works in a ‘regressiony’ way and will assume variables in a
formula are linear predictors. So, a group
variable coded 1
…4
will be
entered as a single parameter where 4 is considered twice as large as 2, etc.
See below for example. In the first model cyl
is entered as a ‘linear slope’;
in the second each value of cyl
(4,5, or 6) is treated as a separate category.
The predictions from each model could be very different:
linear.model <- lm(mpg ~ cyl, data=mtcars)
categorical.model <- lm(mpg ~ factor(cyl), data=mtcars)
In the case of different experimental groups what you would normally want is for
group
to be coded and entered as a number of categorical parameters in your
model. The most common way of doing this is to use ‘dummy coding’, and this is
what R will implement by default for
character or factor variables.
To make sure your categorical variables are entered into your model as categories (and not a slope) you can either:
- Convert the variable to a character or factor type in the dataframe or
- Specify that the variable is a factor when you run the model
For example, here we specify cyl
is a factor within the model formula:
lm(mpg ~ factor(cyl), data=mtcars)
Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)
Coefficients:
(Intercept) factor(cyl)6 factor(cyl)8
26.664 -6.921 -11.564
Whereas here we convert to a factor in the original dataset:
mtcars$cyl.factor <- factor(mtcars$cyl)
lm(mpg ~ cyl.factor, data=mtcars)
Call:
lm(formula = mpg ~ cyl.factor, data = mtcars)
Coefficients:
(Intercept) cyl.factor6 cyl.factor8
26.664 -6.921 -11.564
Neither option is universally better, but if you have variables which are definitely factors (i.e. should never be used as slopes) it’s probably better to convert them in the original dataframe, before you start modelling