Joining different datasets

Many analyses combine data from different sources.

Even within a single study, you may find that you have different datafiles ( e.g. spreadsheets) for different sorts of information. For example you might have

  1. Multiple csv files output by experimental software.

  2. A spreadsheet of demographic characteristics of participants (e.g. their age, gender, handedness)

  3. Questionnaire data, collected before the reaction time task.

Your main analysis might want to model individual RTs using predictors including age, handedness or some personality variable. Thus, you probably want data which look something like this:

df %>%
  pander()
person trial condition female left.handed age extraversion RT
1 1 A TRUE FALSE 19 50 251.7
1 2 A TRUE FALSE 19 50 311.1
1 3 A TRUE FALSE 19 50 343.4
1 A TRUE FALSE 19 50 206.2
2 1 B FALSE FALSE 24 34 317.2
2 2 B FALSE FALSE 24 34 320.2
2 3 B FALSE FALSE 24 34 277
2 B FALSE FALSE 24 34 278.1

The raw data

In this example, we could imagine our raw data files looking like this:

rts %>%
  pander()
person trial RT
1 1 251.7
1 2 311.1
1 3 343.4
1 206.2
2 1 317.2
2 2 320.2
2 3 277
2 278.1
demographics %>%
  pander()
person age left.handed female
1 19 FALSE TRUE
2 24 FALSE FALSE
3 33 TRUE FALSE

And:

personality %>%
  pander()
person extraversion
1 50
2 34
3 47

Joining the parts

To create the combined data file we want, we have to join the different files together.

As you might have noticed, though, the RT’s file has many observations per participant, whereas the demographics and personality data has one row per person.

What we need is to smush this wide format data into the RTs file, such that values get repeated for every row of each participants’ data.

To do this, the simplest method is to use the dplyr::left_join() function:

left_join(rts, demographics, by="person") %>%
  pander
person trial RT age left.handed female
1 1 251.7 19 FALSE TRUE
1 2 311.1 19 FALSE TRUE
1 3 343.4 19 FALSE TRUE
1 206.2 19 FALSE TRUE
2 1 317.2 24 FALSE FALSE
2 2 320.2 24 FALSE FALSE
2 3 277 24 FALSE FALSE
2 278.1 24 FALSE FALSE

In this example I explicitly set id="person" to let R know which variable to use to match the rows of data, although you don’t have to, and dplyr can normally guess. You do have to use the same variable name in both files though (so, person and participant couldn’t be matched, for example).

Other types of joins

As the dplyr manual states: left joins return for two dataframes, x and y will return all rows from x, and all columns from both x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Left joins are probably the most useful, but there are other types which can be useful. To see all of them look at the help files (help('join', 'dplyr')).

To show one common example, if we wanted to check whether we were missing RT data for one of our participants, we could use an anti_join:

anti_join(demographics, rts, by="person") %>%
  pander
person age left.handed female
3 33 TRUE FALSE

This table lists the data for all the people in the demographics file, for whom we don’t have RT data. This can be a useful way of checking you haven’t mislaid a raw datafile (e.g. forgot to copy it from the lab machine!).