Dealing with multiple files
Often you will have multiple data files files - for example, those produced by experimental software.
This is one of the few times when you might have to do something resembling ‘real programming’, but it’s still fairly straightforward.
In the repeated measures Anova example later on in this guide
we encounter some data from an experiment where reaction times were recorded in
25 trials (Trial
) before and after (Time
) one of 4 experimental
manipulations (Condition = {1,2,3,4}
). There were 48 participants in total:
Let’s say that we have saved all the files are in a single directory, and these
are numbered sequentially: person01.csv
, person02.csv
and so on.
Using the list.files()
function we can list the contents of a directory on the
hard drive:
list.files('data/multiple-file-example/')
[1] "person1.csv" "person10.csv" "person11.csv" "person12.csv"
[5] "person13.csv" "person14.csv" "person15.csv" "person16.csv"
[9] "person17.csv" "person18.csv" "person19.csv" "person2.csv"
[13] "person20.csv" "person21.csv" "person22.csv" "person23.csv"
[17] "person24.csv" "person25.csv" "person26.csv" "person27.csv"
[21] "person28.csv" "person29.csv" "person3.csv" "person30.csv"
[25] "person31.csv" "person32.csv" "person33.csv" "person34.csv"
[29] "person35.csv" "person36.csv" "person37.csv" "person38.csv"
[33] "person39.csv" "person4.csv" "person40.csv" "person41.csv"
[37] "person42.csv" "person43.csv" "person44.csv" "person45.csv"
[41] "person46.csv" "person47.csv" "person48.csv" "person5.csv"
[45] "person6.csv" "person7.csv" "person8.csv" "person9.csv"
The list.files()
function creates a vector of the names of all the
files in the directory.
At this point, there are many, many ways of importing the contents of these
files, but below we use a technique which is concise, reliable, and less
error-prone than many others. It also continues to use the dplyr
library.
This approach has 3 steps:
- Put all the names of the .csv files into a dataframe.
- For each row in the dataframe, run a function which imports the file as a dataframe.
- Combine all these dataframes together.
Putting the filenames into a dataframe
Because list.files
produces a vector, we can make them a column in a new
dataframe:
raw.files <- data_frame(filename = list.files('data/multiple-file-example/'))
And we can make a new column with the complete path (i.e. including the
directory holding the files), using the paste0
which combines
strings of text. We wouldn’t have to do this if the raw files were in the same
directory as our RMarkdown file, but that would get messy.
raw.file.paths <- raw.files %>%
mutate(filepath = paste0("data/multiple-file-example/", filename))
raw.file.paths %>%
head(3)
# A tibble: 3 x 2
filename filepath
<chr> <chr>
1 person1.csv data/multiple-file-example/person1.csv
2 person10.csv data/multiple-file-example/person10.csv
3 person11.csv data/multiple-file-example/person11.csv
Using do()
We can then use the do()
function in dplyr::
to import the data for each
file and combine the results in a single dataframe.
The do()
function allows us to run any R function for each group or row in a
dataframe.
The means that our original dataframe is broken up into chunks (either groups of
rows, if we use group_by()
, or individual rows if we use rowwise()
) and each
chunk is fed to the function we specify. This function must do it’s work and
return a new dataframe, and these are then combined into a single larger
dataframe.
So in this example, we break our dataframe of filenames up into individual rows
using rowwise
and then specify the read_csv
function which takes the name of
a csv file, and returns the content as a dataframe
(see the importing data section).
For example:
raw.data <- raw.file.paths %>%
# 'do' the function for each row in turn
rowwise() %>%
do(., read_csv(file=.$filepath))
We can check these data look OK by sampling 10 rows at random:
raw.data %>%
sample_n(10) %>%
pander()
Condition | trial | time | person | RT |
---|---|---|---|---|
2 | 20 | 1 | 24 | 286.3 |
2 | 3 | 2 | 14 | 301.8 |
3 | 22 | 2 | 35 | 211.2 |
1 | 16 | 2 | 4 | 215.4 |
2 | 24 | 2 | 20 | 383.8 |
4 | 19 | 2 | 47 | 172.5 |
4 | 15 | 1 | 37 | 188.7 |
2 | 12 | 2 | 23 | 176.6 |
3 | 11 | 2 | 26 | 242.5 |
3 | 19 | 1 | 28 | 246.8 |
Using custom functions with do()
In this example, each of the raw data files included the participant number (the
person
variable). However, this isn’t always the case.
This isn’t a problem though, if we create our own
helper function to import the data. Writing small functions
in R is very easy, and the example below wraps the read.csv()
function and
adds a new colum, filename
to the imported data frame which would enable us to
keep track of where each row in the final combined dataset came from.
This is the helper function:
read.csv.and.add.filename <- function(filepath){
read_csv(filepath) %>%
mutate(filepath=filepath)
}
In English, you should read this as:
“Create a new R function called
read.csv.and.add.filename
which expects to be passed a path to a csv file as an input. This function reads the csv file at the path (converting it to a dataframe), and adds a new column containing the original file path it read from. It then returns this dataframe.”
We can use our helper function with do()
in place of the bare read_csv
function we used before:
raw.data.with.paths <- raw.file.paths %>%
rowwise() %>%
do(., read.csv.and.add.filename(.$filepath))
raw.data.with.paths %>%
sample_n(10) %>%
pander()
Condition | trial | time | person | RT |
---|---|---|---|---|
1 | 21 | 2 | 6 | 180.8 |
2 | 24 | 1 | 22 | 229 |
4 | 22 | 1 | 40 | 152 |
2 | 3 | 2 | 21 | 173.9 |
2 | 5 | 2 | 13 | 218.7 |
4 | 18 | 1 | 42 | 236.6 |
3 | 15 | 2 | 33 | 122.9 |
4 | 19 | 2 | 37 | 134.9 |
3 | 3 | 2 | 32 | 334.6 |
1 | 8 | 2 | 4 | 435.3 |
filepath |
---|
data/multiple-file-example/person6.csv |
data/multiple-file-example/person22.csv |
data/multiple-file-example/person40.csv |
data/multiple-file-example/person21.csv |
data/multiple-file-example/person13.csv |
data/multiple-file-example/person42.csv |
data/multiple-file-example/person33.csv |
data/multiple-file-example/person37.csv |
data/multiple-file-example/person32.csv |
data/multiple-file-example/person4.csv |
At this point you might need to
use the extract()
or separate()
functions
to post-process the filename and re-create the person
variable from this
(although in this case that’s already been done for us).