Joining different datasets

Many analyses combine data from different sources.

Even within a single study, you may find that you have different datafiles ( e.g. spreadsheets) for different sorts of information. For example you might have

Multiple csv files output by experimental software.
A spreadsheet of demographic characteristics of participants (e.g. their age, gender, handedness)
Questionnaire data, collected before the reaction time task.

Your main analysis might want to model individual RTs using predictors including age, handedness or some personality variable. Thus, you probably want data which look something like this:

df %>%
  pander()

person	trial	condition	female	left.handed	age	extraversion	RT
1	1	A	TRUE	FALSE	19	50	251.7
1	2	A	TRUE	FALSE	19	50	311.1
1	3	A	TRUE	FALSE	19	50	343.4
1	…	A	TRUE	FALSE	19	50	206.2
2	1	B	FALSE	FALSE	24	34	317.2
2	2	B	FALSE	FALSE	24	34	320.2
2	3	B	FALSE	FALSE	24	34	277
2	…	B	FALSE	FALSE	24	34	278.1

The raw data

In this example, we could imagine our raw data files looking like this:

rts %>%
  pander()

person	trial	RT
1	1	251.7
1	2	311.1
1	3	343.4
1	…	206.2
2	1	317.2
2	2	320.2
2	3	277
2	…	278.1

demographics %>%
  pander()

person	age	left.handed	female
1	19	FALSE	TRUE
2	24	FALSE	FALSE
3	33	TRUE	FALSE

And:

personality %>%
  pander()

person	extraversion
1	50
2	34
3	47

Joining the parts

To create the combined data file we want, we have to join the different files together.

As you might have noticed, though, the RT’s file has many observations per participant, whereas the demographics and personality data has one row per person.

What we need is to smush this wide format data into the RTs file, such that values get repeated for every row of each participants’ data.

To do this, the simplest method is to use the dplyr::left_join() function:

left_join(rts, demographics, by="person") %>%
  pander

person	trial	RT	age	left.handed	female
1	1	251.7	19	FALSE	TRUE
1	2	311.1	19	FALSE	TRUE
1	3	343.4	19	FALSE	TRUE
1	…	206.2	19	FALSE	TRUE
2	1	317.2	24	FALSE	FALSE
2	2	320.2	24	FALSE	FALSE
2	3	277	24	FALSE	FALSE
2	…	278.1	24	FALSE	FALSE

In this example I explicitly set id="person" to let R know which variable to use to match the rows of data, although you don’t have to, and dplyr can normally guess. You do have to use the same variable name in both files though (so, person and participant couldn’t be matched, for example).

Other types of joins

As the dplyr manual states: left joins return for two dataframes, x and y will return all rows from x, and all columns from both x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Left joins are probably the most useful, but there are other types which can be useful. To see all of them look at the help files (help('join', 'dplyr')).

To show one common example, if we wanted to check whether we were missing RT data for one of our participants, we could use an anti_join:

anti_join(demographics, rts, by="person") %>%
  pander

person	age	left.handed	female
3	33	TRUE	FALSE

This table lists the data for all the people in the demographics file, for whom we don’t have RT data. This can be a useful way of checking you haven’t mislaid a raw datafile (e.g. forgot to copy it from the lab machine!).