Saving and exporting

Before you start saving data in csv or any other format, as yourself: “Do I need to save this dataset, or should I simply save my raw data and code?”

Oftentimes it’s best to keep only your raw datafiles, with the R code you used to process them. This keeps your disk tidier, and avoids confusion with multiple versions of files.

If it takes a long time to process your data though you might want to save interim steps. And if you share your data (which you should) you might also want to save simplified or anonymised versions of it, in widely-accessible formats.

Use CSV files

Comma-separated-values files are a plain text format which are idea for storing and sharing your data. They are:

  • Understood by almost every piece of software, ever
  • Will be readable in future
  • Perfect for storing 2D data (like dataframes)
  • Readable by humans (just open them in Notepad)

Commercial formats like Excel, SPSS (.sav) and Stata (.dta) don’t have these properties.

Although CSV has some disadvantages, they are all easily overcome if you save the steps of your data processing and analysis in your R code, see below.

Saving a dataframe to .csv is as simple as:

readr::write_csv(mtcars, 'mtcars.csv')

If you run this within an RMarkdown document, this will create the new csv file in the same directory as your .Rmd file.

You can also use the write.csv() function in base R, but this version from readr is faster and has more sensible defalts (e.g. it doesn’t write rownames, but does save column names in the first row)

Save processes, not just outcomes

Many students (and academics) make errors in their analyses because they process data by hand (e.g. editing files in Excel) or use GUI tools to run analyses.

In both cases these errors are hard to identify or rectify because only the outputs of the analysis can be saved, and no record has been made of how these outputs were produced.

In contrast, if you do your data processing and analysis in R/RMarkdown you benefit from a concrete, repeatable series of steps which can be checked/verified by others. This can also save lots of time if you need to processing additional data later on (e.g. if you run more participants).

Some principles to follow when working:

  • Save your raw data in the simplest possible format, in CSV

  • Always include column names in the file

  • Use descriptive names, but with a regular strucuture.

  • Never include spaces or special characters in the column names. Use underscores (_) if you want to make things more readable.

  • Make names <20 characters in length if possible

Saving interim steps

If you are saving data to use again later in R, the best format is RDS. Saving files to RDS is covered in a later section (click to see).

If you are saving interim steps but think you might possibly want to access it from other programmes in future use csv though.

To save something using RDS:

# create a huge df of random numbers...
massive.df <- data_frame(nums = rnorm(1:1e6))
saveRDS(massive.df, file="massive.RDS")

Then later on you can load it like this:

restored.massive.df <-  readRDS('massive.RDS')

If you do this in RMarkdown, by default the RDS files will be saved in the same directory as your .Rmd file.

Archiving, publication and sharing

If you want to share data with someone else, or open it in a different software package, using ‘.csv’ format is strongly recommended unless some other format is common in your field.

When archiving data, or sharing with others, you must document what each column measures, and any processing steps used to create the file. RMarkdown is a good way of doing this because it can combine the processing with narrative explaining what is being done, and why.