A generalised approach
The ‘split, apply, combine’ model
The dplyr::
package, and especially the summarise()
function provides a
generalised way to create dataframes of frequencies and other summary
statistics, grouped and sorted however we like.
Each of the dplyr
‘verbs’ acts on a dataframe in some way, and returns a
dataframe as it’s result. This is convenient because we can chain together the
different verbs to describe exactly the table we want.
For example, let’s say we want the mean of some of our variables across the whole dataframe:
angry.moods %>%
summarise(
mean.anger.out=mean(Anger.Out),
sd.anger.out=sd(Anger.Out)
)
# A tibble: 1 x 2
mean.anger.out sd.anger.out
<dbl> <dbl>
1 16.1 4.22
Here the summarise
function accepts the angry.moods
dataframe as an input,
and has returned a dataframe containing the statistics we need. In this instance
the result dataframe only has one row.
What if we want the numbers for men and women separately?
The key is to think about what we want to achieve, and work out how to describe it. However, in general, we will often want to follow this pattern:
- Split our data (into men and women, or some other categorisation)
- Apply some operation (function) to each group individually (e.g. calculate the mean)
- Combine it into a single table again
It’s helpful to think of this split \(\rightarrow\) apply \(\rightarrow\) combine pattern whenever we are processing data because it makes explicit what we want to do.
Split: breaking the data into groups
The first task is to organise our dataframe into the relevant groups. To do this
we use group_by()
:
angry.moods %>%
group_by(Gender) %>%
glimpse
Observations: 78
Variables: 7
Groups: Gender [2]
$ Gender <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1,…
$ Sports <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,…
$ Anger.Out <dbl> 18, 14, 13, 17, 16, 16, 12, 13, 16, 12, 12, 17,…
$ Anger.In <dbl> 13, 17, 14, 24, 17, 22, 12, 16, 16, 16, 13, 23,…
$ Control.Out <dbl> 23, 25, 28, 23, 26, 25, 31, 22, 22, 29, 24, 24,…
$ Control.In <dbl> 20, 24, 28, 23, 28, 23, 27, 31, 24, 29, 25, 22,…
$ Anger.Expression <dbl> 36, 30, 19, 43, 27, 38, 14, 24, 34, 18, 24, 42,…
Weirdly, this doesn’t seem to have done anything. The data aren’t sorted by
Gender
, and there is no visible sign of the grouping, but stick with it… the
grouping is there and the effect will be clearer in a moment.
Apply and combine
Continuing the example above, once we have grouped our data we can then apply
a function to it — for example, we can summarise each group by taking the mean
of the Anger.Out
variable:
angry.moods %>%
group_by(Gender) %>%
summarise(
mean.anger.out=mean(Anger.Out)
)
# A tibble: 2 x 2
Gender mean.anger.out
<dbl> <dbl>
1 1 16.6
2 2 15.8
The combine step happens automatically for us: dplyr
has combined the
summaries of each gender into a single dataframe for us.
In summary, we:
- split the data by
Gender
, usinggroup_by()
- apply the
summarise()
function - combine the results into a new data frame (happens automatically)
A ‘real’ example
Imagine we have raw data from a study which had measured depression with the PHQ-9 scale.
Each patient
was measured on numerous occasions (the month
of observation is
recorded), and were split into treatment group
(0=control, 1=treatment). The
phq9
variable is calculated as the sum of all their questionnaire responses.
phq9.df %>% glimpse
Observations: 2,429
Variables: 4
$ patient <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
$ group <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
$ month <dbl> 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 18, 0, 1, 2…
$ phq9 <dbl> 2.56, 2.33, 1.89, 2.00, 2.44, 2.33, 2.44, 2.11, 2.22, 3.…
If this were our data we might want to:
- Calculate the average PHQ-9 score at each month, and in each group
- Show these means by group for months 0, 7 and 12
We can do this using group_by
and summarise
:
phq9.df %>%
group_by(group) %>%
summarise(average_phq9 = mean(phq9))
# A tibble: 2 x 2
group average_phq9
<dbl> <dbl>
1 0 1.87
2 1 1.60
You can load the PHQ9 data above by typing:
# remeber to load the tidyverse package first
phq9 <- read_csv('data/phq-summary.csv')
Parsed with column specification:
cols(
patient = col_double(),
group = col_double(),
month = col_double(),
phq9 = col_double()
)
Try to edit the code above to:
- Create summary table with the mean at each month
- The mean at each month, in each group
- The mean and SD by month and group
Fancy reshaping
As noted above, it’s common to combine the process of reshaping and aggregating or summarising in the same step.
For example here we have multiple rows per person, 3 trial at time 1, and 3 more trials at time 2:
expt.data %>%
arrange(person, time, trial) %>%
head %>%
pander
Condition | trial | time | person | RT |
---|---|---|---|---|
1 | 1 | 1 | 1 | 284.5 |
1 | 2 | 1 | 1 | 309.3 |
1 | 3 | 1 | 1 | 346.7 |
1 | 1 | 2 | 1 | 368 |
1 | 2 | 2 | 1 | 263.7 |
1 | 3 | 2 | 1 | 220.4 |
We can reshape and aggregate this in a single step using dcast
. Here we
request the mean for each person at each time, with observations for each time
split across two columns:
library(reshape2)
expt.data %>%
dcast(person~paste0('time',time),
fun.aggregate=mean) %>%
head %>%
pander
Using RT as value column: use value.var to override.
person | time1 | time2 |
---|---|---|
1 | 313.5 | 284 |
2 | 252.2 | 263.3 |
3 | 263.1 | 290.4 |
4 | 271.2 | 249.4 |
5 | 274.3 | 329.9 |
6 | 280.3 | 231.7 |
Here dcast
has correctly guessed that RT
is the value we want to aggregate
(you can specify explicitly with the value.var=
parameter).
dcast
knows to aggregate using the mean because we set this with the
agg.function
parameter; this just stands for ‘aggregation function’.
We don’t have to request the mean though: any function will do. Here we request the SD instead:
expt.data %>%
dcast(person~time,
fun.aggregate=sd) %>%
head %>%
pander
Using RT as value column: use value.var to override.
person | 1 | 2 |
---|---|---|
1 | 31.27 | 75.85 |
2 | 15.07 | 33.98 |
3 | 88.71 | 104 |
4 | 40.42 | 3.913 |
5 | 61.34 | 52.7 |
6 | 35.82 | 17.3 |