A generalised approach

The ‘split, apply, combine’ model

The dplyr:: package, and especially the summarise() function provides a generalised way to create dataframes of frequencies and other summary statistics, grouped and sorted however we like.

Each of the dplyr ‘verbs’ acts on a dataframe in some way, and returns a dataframe as it’s result. This is convenient because we can chain together the different verbs to describe exactly the table we want.

For example, let’s say we want the mean of some of our variables across the whole dataframe:

angry.moods %>%
  summarise(
    mean.anger.out=mean(Anger.Out),
    sd.anger.out=sd(Anger.Out)
  )
# A tibble: 1 x 2
  mean.anger.out sd.anger.out
           <dbl>        <dbl>
1           16.1         4.22

Here the summarise function accepts the angry.moods dataframe as an input, and has returned a dataframe containing the statistics we need. In this instance the result dataframe only has one row.

What if we want the numbers for men and women separately?

The key is to think about what we want to achieve, and work out how to describe it. However, in general, we will often want to follow this pattern:

  • Split our data (into men and women, or some other categorisation)
  • Apply some operation (function) to each group individually (e.g. calculate the mean)
  • Combine it into a single table again

It’s helpful to think of this split \(\rightarrow\) apply \(\rightarrow\) combine pattern whenever we are processing data because it makes explicit what we want to do.

Split: breaking the data into groups

The first task is to organise our dataframe into the relevant groups. To do this we use group_by():

angry.moods %>%
  group_by(Gender) %>%
  glimpse
Observations: 78
Variables: 7
Groups: Gender [2]
$ Gender           <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1,…
$ Sports           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,…
$ Anger.Out        <dbl> 18, 14, 13, 17, 16, 16, 12, 13, 16, 12, 12, 17,…
$ Anger.In         <dbl> 13, 17, 14, 24, 17, 22, 12, 16, 16, 16, 13, 23,…
$ Control.Out      <dbl> 23, 25, 28, 23, 26, 25, 31, 22, 22, 29, 24, 24,…
$ Control.In       <dbl> 20, 24, 28, 23, 28, 23, 27, 31, 24, 29, 25, 22,…
$ Anger.Expression <dbl> 36, 30, 19, 43, 27, 38, 14, 24, 34, 18, 24, 42,…

Weirdly, this doesn’t seem to have done anything. The data aren’t sorted by Gender, and there is no visible sign of the grouping, but stick with it… the grouping is there and the effect will be clearer in a moment.

Apply and combine

Continuing the example above, once we have grouped our data we can then apply a function to it — for example, we can summarise each group by taking the mean of the Anger.Out variable:

angry.moods %>%
  group_by(Gender) %>%
  summarise(
    mean.anger.out=mean(Anger.Out)
  )
# A tibble: 2 x 2
  Gender mean.anger.out
   <dbl>          <dbl>
1      1           16.6
2      2           15.8

The combine step happens automatically for us: dplyr has combined the summaries of each gender into a single dataframe for us.

In summary, we:

  • split the data by Gender, using group_by()
  • apply the summarise() function
  • combine the results into a new data frame (happens automatically)

A ‘real’ example

Imagine we have raw data from a study which had measured depression with the PHQ-9 scale.

Each patient was measured on numerous occasions (the month of observation is recorded), and were split into treatment group (0=control, 1=treatment). The phq9 variable is calculated as the sum of all their questionnaire responses.

phq9.df %>% glimpse
Observations: 2,429
Variables: 4
$ patient <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
$ group   <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
$ month   <dbl> 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 18, 0, 1, 2…
$ phq9    <dbl> 2.56, 2.33, 1.89, 2.00, 2.44, 2.33, 2.44, 2.11, 2.22, 3.

If this were our data we might want to:

  • Calculate the average PHQ-9 score at each month, and in each group
  • Show these means by group for months 0, 7 and 12

We can do this using group_by and summarise:

phq9.df %>%
  group_by(group) %>%
  summarise(average_phq9 = mean(phq9))
# A tibble: 2 x 2
  group average_phq9
  <dbl>        <dbl>
1     0         1.87
2     1         1.60

You can load the PHQ9 data above by typing:

# remeber to load the tidyverse package first
phq9 <- read_csv('data/phq-summary.csv')
Parsed with column specification:
cols(
  patient = col_double(),
  group = col_double(),
  month = col_double(),
  phq9 = col_double()
)

Try to edit the code above to:

  • Create summary table with the mean at each month
  • The mean at each month, in each group
  • The mean and SD by month and group

Fancy reshaping

As noted above, it’s common to combine the process of reshaping and aggregating or summarising in the same step.

For example here we have multiple rows per person, 3 trial at time 1, and 3 more trials at time 2:

expt.data %>%
  arrange(person, time, trial) %>%
  head %>%
  pander
Condition trial time person RT
1 1 1 1 284.5
1 2 1 1 309.3
1 3 1 1 346.7
1 1 2 1 368
1 2 2 1 263.7
1 3 2 1 220.4

We can reshape and aggregate this in a single step using dcast. Here we request the mean for each person at each time, with observations for each time split across two columns:

library(reshape2)
expt.data %>%
    dcast(person~paste0('time',time),
          fun.aggregate=mean) %>%
  head %>%
  pander
Using RT as value column: use value.var to override.
person time1 time2
1 313.5 284
2 252.2 263.3
3 263.1 290.4
4 271.2 249.4
5 274.3 329.9
6 280.3 231.7

Here dcast has correctly guessed that RT is the value we want to aggregate (you can specify explicitly with the value.var= parameter).

dcast knows to aggregate using the mean because we set this with the agg.function parameter; this just stands for ‘aggregation function’.

We don’t have to request the mean though: any function will do. Here we request the SD instead:

expt.data %>%
    dcast(person~time,
          fun.aggregate=sd) %>%
  head %>%
  pander
Using RT as value column: use value.var to override.
person 1 2
1 31.27 75.85
2 15.07 33.98
3 88.71 104
4 40.42 3.913
5 61.34 52.7
6 35.82 17.3