Types of variable

When working with data in Excel or other packages like SPSS you’ve probably become aware that different types of data get treated differently.

For example, in Excel you can’t set up a formula like =SUM(...) on cells which include letters (rather than just numbers). It does’t make sensible.

However, Excel and many other programmes will sometimes make guesses about what to do if you combine different types of data.

For example, in Excel, if you add 28 to 1 Feb 2017 the result is 1 March 2017. This is sometimes what you want, but can often lead to unexpected results and errors in data analyses.

R is much more strict about not mixing types of data. Vectors (or columns in dataframes) can only contain one type of thing. In general, there are probably 4 types of data you will encounter in your data analysis:

  • Numeric variables
  • Character variables
  • Factors
  • Dates

The file data/lakers.RDS contains a dataset adapted from the lubridate::lakers dataset (this is a dataset built into an add-on package for R).

This dataset contains four variables to illustrate the common variable types (a subset of the original dataset which provides scores and other information from each Los Angeles Lakers basketball game in the 2008-2009 season). We have the date, opponent, team, and points variables.

lakers <- readRDS("data/lakers.RDS")
lakers %>% 
  glimpse()
Observations: 34,624
Variables: 4
$ date     <date> 2008-10-28, 2008-10-28, 2008-10-28, 2008-10-28, 2008-1…
$ opponent <chr> "POR", "POR", "POR", "POR", "POR", "POR", "POR", "POR",…
$ team     <fct> OFF, LAL, LAL, LAL, LAL, LAL, POR, LAL, LAL, POR, LAL, …
$ points   <int> 0, 0, 0, 0, 0, 2, 0, 1, 0, 2, 2, 0, 0, 2, 2, 0, 0, 2, 0…

One thing to note here is that the glimpse() command tells us the type of each variable. So we have

  • points: type int, short for integer (i.e. whole numbers).
  • date: type date
  • opponent: type chr, short for ‘character’, or alaphanumeric data
  • team: type fctr, short for factor and

Differences in quantity: numeric variables

We’ve already seem numeric variables in the section on vectors and lists. These behave pretty much as you’d expect, and we won’t expand on them here.

There are different types of numeric variable. Integers (whole numbers) are stored as type int but other types, like dbl, can can store numbers with a decimal place. For most purposes (in doing analyses of psychological data) the differences won’t matter.

Differences in quality or kind

In many cases variables will be used to identify values which are qualitatively different. For example, different groups or measurement occasions in an experimental study, or perhaps different genders or countries in survey data.

In practice, these qualitative differences get stored in a range of different variable types, including:

  • Numeric variables (e.g. time = 1, or time = 2…)
  • Character variables (e.g. time = "time 1", time = "time 2"…)
  • Boolean or logical variables (e.g. time1 == TRUE or time1 == FALSE)
  • ‘Factors’

Storing categories as numeric variables can produce confusing results when running regression models.

For this reason, it’s normally best to store your categorical variables as descriptive strings of letters and numbers (e.g. “Treatment”, “Control”) and avoid simple numbers (e.g. 1, 2, 3). Or as a factor.

Factors for categorical data

Factors are R’s answer to the problem of storing categorical data. Factors assign one number for each unique value in a variable, and allow you to attach a label to it.

This means the categories are stored as numbers ‘under the hood’, but you can also work with factors as though they were strings of letters and numbers, and they display nicely when making tables and graphs.

For example:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10

group.factor <- factor(1:10)
group.factor
 [1] 1  2  3  4  5  6  7  8  9  10
Levels: 1 2 3 4 5 6 7 8 9 10

group.labelled <- factor(1:10, labels = paste("Group", 1:10))
group.labelled
 [1] Group 1  Group 2  Group 3  Group 4  Group 5  Group 6  Group 7 
 [8] Group 8  Group 9  Group 10
10 Levels: Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 ... Group 10

We can see this ‘underlying’ number which represents each category by using as.numeric:

# note, there is no guarantee that "Group 1" == 1 (although it is here)
as.numeric(group.labelled)
 [1]  1  2  3  4  5  6  7  8  9 10

For simple analyses it’s often best to store everything as the character type (letters and numbers), but factors can still be useful for making tables or graphs where the list of categories is known and needs to be in a particular order. For more about factors, and lots of useful functions for working with them, see the forcats:: package: https://github.com/tidyverse/forcats

Dates

Internally, R stores dates as the number of days since January 1, 1970. This means that we can work with dates just like other numbers, and it makes sense to have the min(), or max() of a series of dates:

# the first few dates in the sequence
head(lakers$date)
[1] "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28"
[6] "2008-10-28"

# first and last dates
min(lakers$date)
[1] "2008-10-28"
max(lakers$date)
[1] "2009-04-14"

Because dates are numbers we can also do arithmetic with them, and R will give us a difference (in this case, in days):

max(lakers$date) - min(lakers$date)
Time difference of 168 days

However, R does treat dates slightly differently from other numbers, and will format plot axes appropriately, which is helpful (see more on this in the graphics section):

hist(lakers$date, breaks=7)