In brief
Most time in data analysis is spent ‘tidying up’ data: getting it into a suitable format to get started. Data scientists have a particular definition of tidy: Tidy datasets are “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row” (Wickham 2014b).
It’s often not convenient for humans to enter data in a tidy way, so untidy data is probably more common than tidy data in the wild. But doing good, reproducible science demands that we document each step of our processing in a way that others can check or repeat in future. Tools like R make this easier.
Before you start
So far you have used the rmd
files in a directory called lifesavr_X
where X is the current year. This was created by the bootstrap script you run in the first lifesavr session.
Now we need to create a new folder and set of Rmd files to record our work in the rest of the module.
So, before your start work today:
use the Files pane to make a new folder on the RStudio server to save your work. Call this
rmip_2021
(RMIP is short for Research Methods In Practice, the title of this module.Make sure this new folder is in your home directory (if you’re not sure, watch the video).
Inside this new folder, make a new RMarkdown file (use the ‘file’ menu and choose ‘new’). When you save the file make sure it has the extension
.rmd
, so call itdata-handling.rmd
for example.Use this new
.rmd
file to save your work during this session.
NOTE: this video was made for another module called ‘datafluency’. You should make a folder called
rmip
not `datafluency
Cleaning questionnaire data
More than 80% of the time spent doing data analysis can be taken up getting your data into the right format. There are lots of tricks and techniques you can learn to become fluent in data handling.
In this module we’ll cover just a few—enough to work with our data, when you export it from Office 365.
See this article for a gentle introduction to some good habits https://www.theanalysisfactor.com/preparing-data-analysis/, and also Wickham (2014a)
You might also like to remind yourself of the stage 1 notes on entering data by hand, and uploading it to RStudio, which are here
Loading Excel data
You should always keep your data in csv format. Don’t use Excel or other proprietary formats to store data unless absolutely necessary.
Unfortunately, some tools (including Office 365 Forms) provide data in xlsx
format, and for this we need to use a special function in R. You can import this data to RStudio by:
- Uploading the
xlsx
file to the RStudio server, as you have done before - Loading the
rio
package - Using the
import()
function in that package, instead ofread_csv()
There is an example dataset linked below which we will use for this session. These data were exported from an Office 365 Form, so they are in a similar format to that which your questionnaire data will be:
We can look at the first few rows of the data, using the head
command:
%>% head() sweets
ID Start time Completion time Email Name How much do you like sweets? How much do you like chocolate
1 6 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I don't like them I don't like them
2 7 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral I don't like them
3 8 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I like them I'm neutral
4 9 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral I'm neutral
Gender
1 M
2 F
3 M
4 F
Check the ‘before you start’ instructions at the top of this page
Import the sweets data into RStudio, as shown above
Save it to a new variable called
sweets
Tidying up
For the moment we are going to use the sweets
data from above to illustrate how to tidy up imported data.
When we look at the imported data it’s useful to note:
There are extra columns we don’t need (at least for now).
Some of our variable names are very long and annoying to type (for example
How much do you like sweets?
is the name of one of our columns).Our responses are in text format, rather than as numbers. For example, the data say
"I don't like them"
or"I'm neutral"
rather than numbers from a 1-5 scale.We may have other data (e.g. MCQ test scores for the outcome) that are stored in a separate file.
We need to sort each of these problems to make things more manageable for our analysis.
Selecting and using variables
In the sweets
you imported you will notice that you have:
- variables you don’t need (we want to drop these)
- column names which have long/complex names, which include spaces
R makes using columns with spaces or other special characters very hard.
To use columns with spaces in we must ‘escape’ the spaces and let R know they are part of the name rather than a gap between two different names.
This video shows how (or read below):
To escape spaces and use columns with long names we use the backtick character (the backwards facing apostrophe) to wrap the column name.
In general, if your columns contain spaces or other odd characters like hyphens or question marks then you will need to wrap them in backticks.
Autocomplete
To avoid having to type backticks use RStudio’s autocomplete functionality.
In this video you can see me using autocomplete which wraps the long variable name in backticks automatically.
- Try using autocomplete to select one of the variables in the
sweets
dataset. - Remember to use the shortcut
cmd+shift+m
(mac) orctrl+shift+m
(pc) to enter the pipe symbol quickly using the keyboard.
Selecting columns
If you have a very large dataset it sometimes helps to simplify things. Also, you might need to remove columns to anonymise your data to share with others.
You can create a new dataframe, selecting only columns you need, by using the select()
function. For example:
sweets %>%
sweets_subset <- select(ID, Gender, `How much do you like sweets?`)
Explanation: We have created a new variable called sweets_subset
which contains 3 columns instead of the original 8. We can see this by using glimpse:
%>% glimpse sweets
Rows: 4
Columns: 8
$ ID <dbl> 6, 7, 8, 9
$ `Start time` <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 0…
$ `Completion time` <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 0…
$ Email <chr> "anonymous", "anonymous", "anonymous", "anonymous"
$ Name <lgl> NA, NA, NA, NA
$ `How much do you like sweets?` <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like them", "I'm neutral", "I'm neutral"
$ Gender <chr> "M", "F", "M", "F"
Compared with:
%>% glimpse sweets_subset
Rows: 4
Columns: 3
$ ID <dbl> 6, 7, 8, 9
$ Gender <chr> "M", "F", "M", "F"
$ `How much do you like sweets?` <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"
Create a new variable using (<-
and select
) called sweets_subset
. This should contain only 2 columns from the original imported data.
Renaming columns
Some of the imported variable names in the sweets
data are long and awkward to use.
Most researchers would rename these variables, to make them more usable in R code.
You can rename variables like this:
%>%
datasetname rename(NEW_COLUMN_NAME = OLD_COLUMN_NAME)
So for this example:
%>%
sweets rename(
like_sweets = `How much do you like sweets?`,
like_chocolate = `How much do you like chocolate`,
)
ID Start time Completion time Email Name like_sweets like_chocolate Gender
1 6 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I don't like them I don't like them M
2 7 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral I don't like them F
3 8 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I like them I'm neutral M
4 9 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral I'm neutral F
Explanation of the code: We used rename
to change the names of our variables. We needed to wrap the long names of the questions in ‘backtick’ symbols to make sure R understood it was a single column name.
You should create a new variable to save the renamed dataset (with a descriptive name for use later on):
sweets %>%
sweets.renamed <- rename(
like_sweets = `How much do you like sweets?`,
like_chocolate = `How much do you like chocolate`,
)
Create a copy of the sweets data in which you have renamed the two long column names.
Recoding text data
We noticed above that our responses were stored as text labels like "I don't like them"
rather than on a numeric scale. This makes it hard to use in an analysis.
We need to recode the text variables into numeric versions.
How to do it
First we must tell R what number we want to use for each text label. That is, we create a mapping of numbers to labels.
This takes a few steps:
- Check exactly what the text values are which need to be mapped.
- Make a mapping variable which assigns each text value a number value
- Use the
recode
function withmutate
to create a new, numeric column
This video walks you through the steps below:
Step 1: Check EXACTLY what text labels we have
To check which labels we need to recode, I select the column in question and use the unique()
function.
%>%
sweets select(`How much do you like sweets?`) %>%
unique()
How much do you like sweets?
1 I don't like them
2 I'm neutral
3 I like them
Do the same to find out the possible values of the How much do you like chocolate
column.
%>%
sweets select(`How much do you like chocolate`) %>%
distinct()
How much do you like chocolate
1 I don't like them
2 I'm neutral
Step 2: Make a mapping variable
We do this by creating what R calles a named vector, which is a special kind of list.
To make a named vector we use the the c()
function. The letter c
here just stands for ‘combine’ — i.e. ‘combine these things into a list’.
This is a simple example:
c("No" = 0, "Yes" = 1) mapping.list <-
We could then use this mapping to recode a column of data which contained the words “No” or "Yes
A useful trick when creating your own mappings is to use R to do the formatting for you (see the video above for a demo).
Re-using the code from the previous step, we use unique()
to show us the unique values for the quuestion about sweets.
We then pipe the result to the paste()
and cat()
functions, like this:
%>%
sweets select(`How much do you like sweets?`) %>%
unique() %>%
paste() %>% cat()
c("I don't like them", "I'm neutral", "I like them")
Explanation of the output: Using paste
and cat
is a bit of a hack. When we run this code we see the output c("I don't like them", "I'm neutral", "I like them")
. This is a list of the values in the sweets
data for this question, formatted in a way that will be useful to us in the next step.
We then copy and paste this output into a NEW code block, and edit it to assign our mappings:
c("I don't like them" = -1, "I'm neutral" = 0, "I like them" = 1) likert.mappings <-
Explanation of the code: We used the previous output to create a mapping. By adding the = -1
and = 0
etc, we have told R what value we want to assign for each label.
Q: How do you know what number values to assign?
A: It doesn’t matter, provided:
- The intervals between each options are the same and
- Each text value has a different number
So, if we had a Likert-scale ranging from “Completely agree” to “Completely disagree” in 7 increments, we could score this from 0 - 6
or 1 - 7
, or -3 - 3
. These would all be fine.
Step 3: Use the mapping variable to recode the column
We can use our new mapping with the mutate
and recode
functions to make a new column, containing numbers rather than text:
sweets %>%
sweets.recoded <- mutate(
like_sweets_numeric =
recode(`How much do you like sweets?`, !!!likert.mappings)
)
We can see this new column if we use glimpse
:
%>% glimpse() sweets.recoded
Rows: 4
Columns: 9
$ ID <dbl> 6, 7, 8, 9
$ `Start time` <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 0…
$ `Completion time` <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 0…
$ Email <chr> "anonymous", "anonymous", "anonymous", "anonymous"
$ Name <lgl> NA, NA, NA, NA
$ `How much do you like sweets?` <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like them", "I'm neutral", "I'm neutral"
$ Gender <chr> "M", "F", "M", "F"
$ like_sweets_numeric <dbl> -1, 0, 1, 0
Explanation of the code: The start of the first line is sweets.recoded <-
which means make a new variable called sweets.recoded
. Then we use mutate
to create a new column called like_sweets_numeric
. We make this column using recode
on the question about liking sweets. We use the likert.mappings
mapping to specify what numeric score to give each of the text values.
Watch out for the exclamations marks!!!
: In the code there are three exclamation marks, !!!
, before the mapping; make sure you do the same.
Summary/video explanation
This is probably the trickiest part of all the R code we use. I’ve included an annotated video of just these steps as a reference below:
Use this three-step process to create a recoded version of the like_chocolate
variable.
Remember to watch the video at the start of this section, or the short version in the green box above, if anything is unclear.
If you have any questions or get stuck, please ask questions in the workshop or on Discourse
Reverse codings
Watch out for reverse ‘coded items’. Imagine your questionnaire includes two items:
- I really love sweets
- I really hate sweets
If this were the case then using the same mapping and adding responses to these questions would not be appropriate.
Instead, we should use two separate mappings for the different questions:
c(
agree.disagree.responses <-"Agree" = 3,
"Neutral" = 2,
"Disagree" = 1)
c(
agree.disagree.responses.REVERSED <-"Agree" = 1,
"Neutral" = 2,
"Disagree" = 3)
Coded in this way, scores for the questions could be added or averaged to provide a summary score.
Multiple new columns
We can create multiple new columns at once using mutate, to recode each of our variables:
sweets.renamed %>%
sweets.recoded <- mutate(
like_sweets_numeric = recode(like_sweets, !!!likert.mappings),
like_chocolate_numeric = recode(like_chocolate, !!!likert.mappings)
)
Explanation of the code: We combined the two examples above to create 3 new columns with recoded (numeric) values. We saved this to the variable called sweets.recoded
.
Combining scores
We designed our questionnaires to make repeated measurements of the same phenomena.
That means we don’t want to analyse the questions individually: we’d like to sum or average the scores per-person.
There are various ways of describing this: summary scores, aggregate scores, scale score, total score all likely refer to the same thing: a single new variable that we have created by adding up or averaging scores on a number of variables.
Q: Does it matter if you average* or take the sum of the questionnaire items?*
A: No! It might seem surprising, but it doesn’t matter. If we use a summary score in a regression then the coefficients will be a bit different, because the scale of our predictor variable has changed. But the strength of the relationship will be the same, and the tests statistics we use will be identical.
We already saw how to use mutate
to create a new column in the section on recoding. We can use this again to create the sum of both questions:
%>%
sweets.recoded mutate(liking = like_sweets_numeric + like_chocolate_numeric) %>%
select(starts_with('like'))
like_sweets like_chocolate like_sweets_numeric like_chocolate_numeric
1 I don't like them I don't like them -1 -1
2 I'm neutral I don't like them 0 -1
3 I like them I'm neutral 1 0
4 I'm neutral I'm neutral 0 0
Explanation of the code: We added both the questions which asked about ‘liking’ together. This created a new column containing the combined score, called liking
. In the third line we selected only columns with names that started with the word ‘like’, to make the output easier to inspect
We should probably save the result using a new variable name though, so we can use it later:
sweets.recoded %>%
sweets.recoded.withsummary <- mutate(liking = like_sweets_numeric + like_chocolate_numeric) %>%
select(ID, matches('like'), liking)
Consolidation
Consolidate all the steps shown above to do the following in a single code block:
- Read in the sweets xlsx file
- Rename the long column names to something short, and without spaces
- Recode the ‘like sweets’ and ‘like chocolate’ columns to be numeric
- Create a summary score called
sweet_tooth
which is the sum of the sweets and chocolate questions - Save the result of this work in a new variable called
sweets.tidy
import('sweets.xlsx')
sweets <-
# used to check what response values are in each question
%>%
sweets select(`How much do you like sweets?`) %>%
unique() %>% paste %>% cat
c("I don't like them", "I'm neutral", "I like them")
%>%
sweets select(`How much do you like chocolate`) %>%
unique() %>% paste %>% cat
c("I don't like them", "I'm neutral")
# we will use the same mapping for both questions because they have the same responses
c("I don't like them" = -1, "I'm neutral" = 0, "I like them" = 1)
sweets.map <-
sweets %>%
sweets.tidy <- # now we recode the two text variables (we only need use mutate once though)
mutate(
like_sweets = recode(`How much do you like sweets?`, !!!sweets.map),
like_choc = recode(`How much do you like chocolate`, !!!sweets.map)
%>%
) # and create the summary score
mutate(sweet_tooth = like_sweets + like_choc ) %>%
select(ID, like_sweets, like_choc, sweet_tooth)
# finally, check the result. it looks ok
%>% glimpse sweets.tidy
Rows: 4
Columns: 4
$ ID <dbl> 6, 7, 8, 9
$ like_sweets <dbl> -1, 0, 1, 0
$ like_choc <dbl> -1, -1, 0, 0
$ sweet_tooth <dbl> -2, -1, 1, 0
Homework
Make sure you are up to date with all of the
lifesavR
worksheets, and this data handling worksheet.Keep a copy of all your code in an Rmd file, and spend 5 minutes tidying this up and adding comments to explain what the different commands do.
If you are an MSc student (or a BSc student and feel like you could do with some revision) work through the section of this worksheet which explains standard deviations and the later section on ‘effect sizes’. You don’t need to complete the whole worksheet (most of the content should be familiar) but the section on effect sizes will be important in our final session on reporting.
TARA Led session
MCQ and Survey completion
If you have not already done so, complete the MCQ now at this link:
Everyone also now needs to complete the 2 selected surveys linked from this discourse post
A real example
The sweets data in the previous session were a very simple example.
We can practice these skills using a dataset generated by students last year. This group’s questionnaire measured sleep quality, and the data are available here:
This data file has been slightly cleaned up, but is similar to the format that your O365 output will take. The only changes I made to these data were to:
- Delete identifying information (names and emails)
- Remove one or two extra columns
Tidy up and recode the sleep dataset:
- Import the
sleep.xlsx
document (download xfun::embed_file(c(‘data/student-examples/sleep.xlsx’), text=“here”)) - Use
rename
to change the variable names to be short and easy to use. - Use
mutate
withrecode
to change at least 3 of the questionnaire items to numeric scores (watch out for reverse coding).
When you have done this,
Use
mutate
to combine the recoded questions into a single summary score. This can be either to sum or the average of the 6 questions you recoded.Create a plot of this summary score, showing the distribution of values (e.g. a histogram).
Save a cleaned-up copy of the data using
write_csv
.
Extension exercises
‘Matching’ columns by names
This is a slightly more advanced technique. It can sometimes save time, but it’s not required to pass the course so skip this if you are short on time or finding R tricky.
Another good way to avoid backticks and long column names is to use the matches()
function. Especially so if you have several questions which start in the same way and have a consistent root.
The example below selects all columns where the name includes the text "How much do you"
, as well as the ID
and Gender
columns:
sweets %>%
sweets_how_much <- select(ID, Gender, matches("How much"))
%>% glimpse sweets_how_much
Rows: 4
Columns: 4
$ ID <dbl> 6, 7, 8, 9
$ Gender <chr> "M", "F", "M", "F"
$ `How much do you like sweets?` <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like them", "I'm neutral", "I'm neutral"
Explanation of the code: We used the select
command to choose only the columns we needed. We used the matches()
function to avoid typing each of the long question column-names by hand. When we say matches("How much")
, this selects all the columns which start with the text "How much"
. We save this subset of columns with a new variable name, sweets_how_much
Other useful functions for this purpose are:
starts_with("str")
ends_with("str")
contains("str")
num_range("Q", 1:10)
(would match variables called"Q1"
or"Q01"
,"Q2"
, through to"Q10"
)
See https://tidyselect.r-lib.org/reference/select_helpers.html for more information.
- Try selecting the two columns called
Start time
andCompletion time
using thematches
function.
%>%
sweets select(matches("time"))
Start time Completion time
1 2019-05-24 09:12:31 2019-05-24 09:12:35
2 2019-05-24 09:12:31 2019-05-24 09:12:35
3 2019-05-24 09:12:31 2019-05-24 09:12:35
4 2019-05-24 09:12:31 2019-05-24 09:12:35
Assessment progress
This session introduces skills required for questions 3, 4 and 5 of the data assignment.
References
Wickham, Hadley. 2014a. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.
———. 2014b. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.