Dealing with multiple files

Often you will have multiple data files files - for example, those produced by experimental software.

This is one of the few times when you might have to do something resembling ‘real programming’, but it’s still fairly straightforward.

In the repeated measures Anova example later on in this guide we encounter some data from an experiment where reaction times were recorded in 25 trials (Trial) before and after (Time) one of 4 experimental manipulations (Condition = {1,2,3,4}). There were 48 participants in total:

Let’s say that we have saved all the files are in a single directory, and these are numbered sequentially: person01.csv, person02.csv and so on.

Using the list.files() function we can list the contents of a directory on the hard drive:

list.files('data/multiple-file-example/')
 [1] "person1.csv"  "person10.csv" "person11.csv" "person12.csv"
 [5] "person13.csv" "person14.csv" "person15.csv" "person16.csv"
 [9] "person17.csv" "person18.csv" "person19.csv" "person2.csv" 
[13] "person20.csv" "person21.csv" "person22.csv" "person23.csv"
[17] "person24.csv" "person25.csv" "person26.csv" "person27.csv"
[21] "person28.csv" "person29.csv" "person3.csv"  "person30.csv"
[25] "person31.csv" "person32.csv" "person33.csv" "person34.csv"
[29] "person35.csv" "person36.csv" "person37.csv" "person38.csv"
[33] "person39.csv" "person4.csv"  "person40.csv" "person41.csv"
[37] "person42.csv" "person43.csv" "person44.csv" "person45.csv"
[41] "person46.csv" "person47.csv" "person48.csv" "person5.csv" 
[45] "person6.csv"  "person7.csv"  "person8.csv"  "person9.csv" 

The list.files() function creates a vector of the names of all the files in the directory.

At this point, there are many, many ways of importing the contents of these files, but below we use a technique which is concise, reliable, and less error-prone than many others. It also continues to use the dplyr library.

This approach has 3 steps:

  1. Put all the names of the .csv files into a dataframe.
  2. For each row in the dataframe, run a function which imports the file as a dataframe.
  3. Combine all these dataframes together.
Putting the filenames into a dataframe

Because list.files produces a vector, we can make them a column in a new dataframe:

raw.files <- data_frame(filename = list.files('data/multiple-file-example/'))

And we can make a new column with the complete path (i.e. including the directory holding the files), using the paste0 which combines strings of text. We wouldn’t have to do this if the raw files were in the same directory as our RMarkdown file, but that would get messy.

raw.file.paths <- raw.files  %>%
  mutate(filepath = paste0("data/multiple-file-example/", filename))

raw.file.paths %>%
  head(3)
# A tibble: 3 x 2
  filename     filepath                               
  <chr>        <chr>                                  
1 person1.csv  data/multiple-file-example/person1.csv 
2 person10.csv data/multiple-file-example/person10.csv
3 person11.csv data/multiple-file-example/person11.csv
Using do()

We can then use the do() function in dplyr:: to import the data for each file and combine the results in a single dataframe.

The do() function allows us to run any R function for each group or row in a dataframe.

The means that our original dataframe is broken up into chunks (either groups of rows, if we use group_by(), or individual rows if we use rowwise()) and each chunk is fed to the function we specify. This function must do it’s work and return a new dataframe, and these are then combined into a single larger dataframe.

So in this example, we break our dataframe of filenames up into individual rows using rowwise and then specify the read_csv function which takes the name of a csv file, and returns the content as a dataframe (see the importing data section).

For example:

raw.data <- raw.file.paths %>%
  # 'do' the function for each row in turn
  rowwise() %>%
  do(., read_csv(file=.$filepath))

We can check these data look OK by sampling 10 rows at random:

raw.data %>%
  sample_n(10) %>%
  pander()
Condition trial time person RT
2 20 1 24 286.3
2 3 2 14 301.8
3 22 2 35 211.2
1 16 2 4 215.4
2 24 2 20 383.8
4 19 2 47 172.5
4 15 1 37 188.7
2 12 2 23 176.6
3 11 2 26 242.5
3 19 1 28 246.8
Using custom functions with do()

In this example, each of the raw data files included the participant number (the person variable). However, this isn’t always the case.

This isn’t a problem though, if we create our own helper function to import the data. Writing small functions in R is very easy, and the example below wraps the read.csv() function and adds a new colum, filename to the imported data frame which would enable us to keep track of where each row in the final combined dataset came from.

This is the helper function:

read.csv.and.add.filename <- function(filepath){
  read_csv(filepath) %>%
    mutate(filepath=filepath)
}

In English, you should read this as:

“Create a new R function called read.csv.and.add.filename which expects to be passed a path to a csv file as an input. This function reads the csv file at the path (converting it to a dataframe), and adds a new column containing the original file path it read from. It then returns this dataframe.”

We can use our helper function with do() in place of the bare read_csv function we used before:

raw.data.with.paths <- raw.file.paths %>%
  rowwise() %>%
  do(., read.csv.and.add.filename(.$filepath))

raw.data.with.paths %>%
  sample_n(10) %>%
  pander()
Table continues below
Condition trial time person RT
1 21 2 6 180.8
2 24 1 22 229
4 22 1 40 152
2 3 2 21 173.9
2 5 2 13 218.7
4 18 1 42 236.6
3 15 2 33 122.9
4 19 2 37 134.9
3 3 2 32 334.6
1 8 2 4 435.3
filepath
data/multiple-file-example/person6.csv
data/multiple-file-example/person22.csv
data/multiple-file-example/person40.csv
data/multiple-file-example/person21.csv
data/multiple-file-example/person13.csv
data/multiple-file-example/person42.csv
data/multiple-file-example/person33.csv
data/multiple-file-example/person37.csv
data/multiple-file-example/person32.csv
data/multiple-file-example/person4.csv

At this point you might need to use the extract() or separate() functions to post-process the filename and re-create the person variable from this (although in this case that’s already been done for us).