Joining different datasets
Many analyses combine data from different sources.
Even within a single study, you may find that you have different datafiles ( e.g. spreadsheets) for different sorts of information. For example you might have
Multiple csv files output by experimental software.
A spreadsheet of demographic characteristics of participants (e.g. their age, gender, handedness)
Questionnaire data, collected before the reaction time task.
Your main analysis might want to model individual RTs using predictors including age, handedness or some personality variable. Thus, you probably want data which look something like this:
df %>%
pander()
person | trial | condition | female | left.handed | age | extraversion | RT |
---|---|---|---|---|---|---|---|
1 | 1 | A | TRUE | FALSE | 19 | 50 | 251.7 |
1 | 2 | A | TRUE | FALSE | 19 | 50 | 311.1 |
1 | 3 | A | TRUE | FALSE | 19 | 50 | 343.4 |
1 | … | A | TRUE | FALSE | 19 | 50 | 206.2 |
2 | 1 | B | FALSE | FALSE | 24 | 34 | 317.2 |
2 | 2 | B | FALSE | FALSE | 24 | 34 | 320.2 |
2 | 3 | B | FALSE | FALSE | 24 | 34 | 277 |
2 | … | B | FALSE | FALSE | 24 | 34 | 278.1 |
The raw data
In this example, we could imagine our raw data files looking like this:
rts %>%
pander()
person | trial | RT |
---|---|---|
1 | 1 | 251.7 |
1 | 2 | 311.1 |
1 | 3 | 343.4 |
1 | … | 206.2 |
2 | 1 | 317.2 |
2 | 2 | 320.2 |
2 | 3 | 277 |
2 | … | 278.1 |
demographics %>%
pander()
person | age | left.handed | female |
---|---|---|---|
1 | 19 | FALSE | TRUE |
2 | 24 | FALSE | FALSE |
3 | 33 | TRUE | FALSE |
And:
personality %>%
pander()
person | extraversion |
---|---|
1 | 50 |
2 | 34 |
3 | 47 |
Joining the parts
To create the combined data file we want, we have to join the different files together.
As you might have noticed, though, the RT’s file has many observations per participant, whereas the demographics and personality data has one row per person.
What we need is to smush this wide format data into the RTs file, such that values get repeated for every row of each participants’ data.
To do this, the simplest method is to use the dplyr::left_join()
function:
left_join(rts, demographics, by="person") %>%
pander
person | trial | RT | age | left.handed | female |
---|---|---|---|---|---|
1 | 1 | 251.7 | 19 | FALSE | TRUE |
1 | 2 | 311.1 | 19 | FALSE | TRUE |
1 | 3 | 343.4 | 19 | FALSE | TRUE |
1 | … | 206.2 | 19 | FALSE | TRUE |
2 | 1 | 317.2 | 24 | FALSE | FALSE |
2 | 2 | 320.2 | 24 | FALSE | FALSE |
2 | 3 | 277 | 24 | FALSE | FALSE |
2 | … | 278.1 | 24 | FALSE | FALSE |
In this example I explicitly set id="person"
to let R know which variable to
use to match the rows of data, although you don’t have to, and dplyr
can
normally guess. You do have to use the same variable name in both files though
(so, person
and participant
couldn’t be matched, for example).
Other types of joins
As the dplyr manual states: left joins return for two dataframes, x and y will return all rows from x, and all columns from both x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
Left joins are probably the most useful, but there are other types which can be
useful. To see all of them look at the help files (help('join', 'dplyr')
).
To show one common example, if we wanted to check whether we were missing RT
data for one of our participants, we could use an anti_join
:
anti_join(demographics, rts, by="person") %>%
pander
person | age | left.handed | female |
---|---|---|---|
3 | 33 | TRUE | FALSE |
This table lists the data for all the people in the demographics file, for whom we don’t have RT data. This can be a useful way of checking you haven’t mislaid a raw datafile (e.g. forgot to copy it from the lab machine!).