In brief

Most time in data analysis is spent ‘tidying up’ data: getting it into a suitable format to get started. Data scientists have a particular definition of tidy: Tidy datasets are “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row” (Wickham 2014 b).

It’s often not convenient for humans to enter data in a tidy way, so untidy data is probably more common than tidy data in the wild. But doing good, reproducible science demands that we document each step of our processing in a way that others can check or repeat in future. Tools like R make this easier.

Before you start

So far you have used the rmd files in a directory called lifesavr_X where X is the current year. This was created by the bootstrap script you run in the first lifesavr session.

Now we need to create a new folder and set of Rmd files to record our work in the rest of the module.

So, before your start work today:

use the Files pane to make a new folder on the RStudio server to save your work. Call this rmip_2021 (RMIP is short for Research Methods In Practice, the title of this module.
Make sure this new folder is in your home directory (if you’re not sure, watch the video).
Inside this new folder, make a new RMarkdown file (use the ‘file’ menu and choose ‘new’). When you save the file make sure it has the extension .rmd, so call it data-handling.rmd for example.
Use this new .rmd file to save your work during this session.

NOTE: this video was made for another module called ‘datafluency’. You should make a folder called rmip not `datafluency

Cleaning questionnaire data

More than 80% of the time spent doing data analysis can be taken up getting your data into the right format. There are lots of tricks and techniques you can learn to become fluent in data handling.

In this module we’ll cover just a few—enough to work with our data, when you export it from Office 365.

See this article for a gentle introduction to some good habits https://www.theanalysisfactor.com/preparing-data-analysis/, and also Wickham (2014 a)

You might also like to remind yourself of the stage 1 notes on entering data by hand, and uploading it to RStudio, which are here

Loading Excel data

You should always keep your data in csv format. Don’t use Excel or other proprietary formats to store data unless absolutely necessary.

Unfortunately, some tools (including Office 365 Forms) provide data in xlsx format, and for this we need to use a special function in R. You can import this data to RStudio by:

Uploading the xlsx file to the RStudio server, as you have done before
Loading the rio package
Using the import() function in that package, instead of read_csv()

There is an example dataset linked below which we will use for this session. These data were exported from an Office 365 Form, so they are in a similar format to that which your questionnaire data will be:

Sweets example in xslx format

We can look at the first few rows of the data, using the head command:

sweets %>% head()

  ID          Start time     Completion time     Email Name How much do you like sweets? How much do you like chocolate
1  6 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA            I don't like them              I don't like them
2  7 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA                  I'm neutral              I don't like them
3  8 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA                  I like them                    I'm neutral
4  9 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA                  I'm neutral                    I'm neutral
  Gender
1      M
2      F
3      M
4      F

Check the ‘before you start’ instructions at the top of this page
Import the sweets data into RStudio, as shown above
Save it to a new variable called sweets

Tidying up

For the moment we are going to use the sweets data from above to illustrate how to tidy up imported data.

When we look at the imported data it’s useful to note:

There are extra columns we don’t need (at least for now).
Some of our variable names are very long and annoying to type (for example How much do you like sweets? is the name of one of our columns).
Our responses are in text format, rather than as numbers. For example, the data say "I don't like them" or "I'm neutral" rather than numbers from a 1-5 scale.
We may have other data (e.g. MCQ test scores for the outcome) that are stored in a separate file.

We need to sort each of these problems to make things more manageable for our analysis.

Selecting and using variables

In the sweets you imported you will notice that you have:

variables you don’t need (we want to drop these)
column names which have long/complex names, which include spaces

R makes using columns with spaces or other special characters very hard.

To use columns with spaces in we must ‘escape’ the spaces and let R know they are part of the name rather than a gap between two different names.

This video shows how (or read below):

To escape spaces and use columns with long names we use the backtick character (the backwards facing apostrophe) to wrap the column name.

In general, if your columns contain spaces or other odd characters like hyphens or question marks then you will need to wrap them in backticks.

On windows

On a Mac

Autocomplete

To avoid having to type backticks use RStudio’s autocomplete functionality.

In this video you can see me using autocomplete which wraps the long variable name in backticks automatically.

Try using autocomplete to select one of the variables in the sweets dataset.
Remember to use the shortcut cmd+shift+m (mac) or ctrl+shift+m (pc) to enter the pipe symbol quickly using the keyboard.

Selecting columns

If you have a very large dataset it sometimes helps to simplify things. Also, you might need to remove columns to anonymise your data to share with others.

You can create a new dataframe, selecting only columns you need, by using the select() function. For example:

sweets_subset <- sweets %>%
  select(ID, Gender, `How much do you like sweets?`)

Explanation: We have created a new variable called sweets_subset which contains 3 columns instead of the original 8. We can see this by using glimpse:

sweets %>% glimpse

Rows: 4
Columns: 8
$ ID                               <dbl> 6, 7, 8, 9
$ `Start time`                     <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 0…
$ `Completion time`                <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 0…
$ Email                            <chr> "anonymous", "anonymous", "anonymous", "anonymous"
$ Name                             <lgl> NA, NA, NA, NA
$ `How much do you like sweets?`   <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like them", "I'm neutral", "I'm neutral"
$ Gender                           <chr> "M", "F", "M", "F"

Compared with:

sweets_subset %>% glimpse

Rows: 4
Columns: 3
$ ID                             <dbl> 6, 7, 8, 9
$ Gender                         <chr> "M", "F", "M", "F"
$ `How much do you like sweets?` <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"

Create a new variable using (<- and select) called sweets_subset. This should contain only 2 columns from the original imported data.

Renaming columns

Some of the imported variable names in the sweets data are long and awkward to use.

Most researchers would rename these variables, to make them more usable in R code.

You can rename variables like this:

datasetname %>% 
  rename(NEW_COLUMN_NAME = OLD_COLUMN_NAME)

So for this example:

sweets %>%
  rename(
    like_sweets = `How much do you like sweets?`,
    like_chocolate = `How much do you like chocolate`,
  )

  ID          Start time     Completion time     Email Name       like_sweets    like_chocolate Gender
1  6 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA I don't like them I don't like them      M
2  7 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA       I'm neutral I don't like them      F
3  8 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA       I like them       I'm neutral      M
4  9 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous   NA       I'm neutral       I'm neutral      F

Explanation of the code: We used rename to change the names of our variables. We needed to wrap the long names of the questions in ‘backtick’ symbols to make sure R understood it was a single column name.

You should create a new variable to save the renamed dataset (with a descriptive name for use later on):

sweets.renamed <- sweets %>%
  rename(
    like_sweets = `How much do you like sweets?`,
    like_chocolate = `How much do you like chocolate`,
  )

Create a copy of the sweets data in which you have renamed the two long column names.

Recoding text data

We noticed above that our responses were stored as text labels like "I don't like them" rather than on a numeric scale. This makes it hard to use in an analysis.

We need to recode the text variables into numeric versions.

How to do it

First we must tell R what number we want to use for each text label. That is, we create a mapping of numbers to labels.

This takes a few steps:

Check exactly what the text values are which need to be mapped.
Make a mapping variable which assigns each text value a number value
Use the recode function with mutate to create a new, numeric column

This video walks you through the steps below:

Step 1: Check EXACTLY what text labels we have

To check which labels we need to recode, I select the column in question and use the unique() function.

sweets %>% 
  select(`How much do you like sweets?`) %>% 
  unique()

  How much do you like sweets?
1            I don't like them
2                  I'm neutral
3                  I like them

Do the same to find out the possible values of the How much do you like chocolate column.

sweets %>% 
  select(`How much do you like chocolate`) %>% 
  distinct()

  How much do you like chocolate
1              I don't like them
2                    I'm neutral

Step 2: Make a mapping variable

We do this by creating what R calles a named vector, which is a special kind of list.

To make a named vector we use the the c() function. The letter c here just stands for ‘combine’ — i.e. ‘combine these things into a list’.

This is a simple example:

mapping.list <- c("No" = 0, "Yes" = 1)

We could then use this mapping to recode a column of data which contained the words “No” or "Yes

A useful trick when creating your own mappings is to use R to do the formatting for you (see the video above for a demo).

Re-using the code from the previous step, we use unique() to show us the unique values for the quuestion about sweets.

We then pipe the result to the paste() and cat() functions, like this:

sweets %>% 
  select(`How much do you like sweets?`) %>% 
  unique() %>% 
  paste() %>% cat()

c("I don't like them", "I'm neutral", "I like them")

Explanation of the output: Using paste and cat is a bit of a hack. When we run this code we see the output c("I don't like them", "I'm neutral", "I like them"). This is a list of the values in the sweets data for this question, formatted in a way that will be useful to us in the next step.

We then copy and paste this output into a NEW code block, and edit it to assign our mappings:

likert.mappings <- c("I don't like them" = -1, "I'm neutral" = 0, "I like them" = 1)

Explanation of the code: We used the previous output to create a mapping. By adding the = -1 and = 0 etc, we have told R what value we want to assign for each label.

Q: How do you know what number values to assign?

A: It doesn’t matter, provided:

The intervals between each options are the same and
Each text value has a different number

So, if we had a Likert-scale ranging from “Completely agree” to “Completely disagree” in 7 increments, we could score this from 0 - 6 or 1 - 7, or -3 - 3. These would all be fine.

Step 3: Use the mapping variable to recode the column

We can use our new mapping with the mutate and recode functions to make a new column, containing numbers rather than text:

sweets.recoded <-  sweets %>%
    mutate(
        like_sweets_numeric =
            recode(`How much do you like sweets?`, !!!likert.mappings)
    )

We can see this new column if we use glimpse:

sweets.recoded %>% glimpse()

Rows: 4
Columns: 9
$ ID                               <dbl> 6, 7, 8, 9
$ `Start time`                     <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-24 0…
$ `Completion time`                <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05-24 0…
$ Email                            <chr> "anonymous", "anonymous", "anonymous", "anonymous"
$ Name                             <lgl> NA, NA, NA, NA
$ `How much do you like sweets?`   <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like them", "I'm neutral", "I'm neutral"
$ Gender                           <chr> "M", "F", "M", "F"
$ like_sweets_numeric              <dbl> -1, 0, 1, 0

Explanation of the code: The start of the first line is sweets.recoded <- which means make a new variable called sweets.recoded. Then we use mutate to create a new column called like_sweets_numeric. We make this column using recode on the question about liking sweets. We use the likert.mappings mapping to specify what numeric score to give each of the text values.

Watch out for the exclamations marks!!!: In the code there are three exclamation marks, !!!, before the mapping; make sure you do the same.

Summary/video explanation

This is probably the trickiest part of all the R code we use. I’ve included an annotated video of just these steps as a reference below:

Use this three-step process to create a recoded version of the like_chocolate variable.

Remember to watch the video at the start of this section, or the short version in the green box above, if anything is unclear.

If you have any questions or get stuck, please ask questions in the workshop or on Discourse

Reverse codings

Watch out for reverse ‘coded items’. Imagine your questionnaire includes two items:

I really love sweets
I really hate sweets

If this were the case then using the same mapping and adding responses to these questions would not be appropriate.

Instead, we should use two separate mappings for the different questions:

agree.disagree.responses <- c(
            "Agree" = 3,
            "Neutral" = 2,
            "Disagree" = 1)

agree.disagree.responses.REVERSED <- c(
            "Agree" = 1,
            "Neutral" = 2,
            "Disagree" = 3)

Coded in this way, scores for the questions could be added or averaged to provide a summary score.

Multiple new columns

We can create multiple new columns at once using mutate, to recode each of our variables:

sweets.recoded  <- sweets.renamed %>%
    mutate(
        like_sweets_numeric = recode(like_sweets, !!!likert.mappings),
        like_chocolate_numeric = recode(like_chocolate, !!!likert.mappings)
    )

Explanation of the code: We combined the two examples above to create 3 new columns with recoded (numeric) values. We saved this to the variable called sweets.recoded.

Combining scores

We designed our questionnaires to make repeated measurements of the same phenomena.

That means we don’t want to analyse the questions individually: we’d like to sum or average the scores per-person.

There are various ways of describing this: summary scores, aggregate scores, scale score, total score all likely refer to the same thing: a single new variable that we have created by adding up or averaging scores on a number of variables.

Q: Does it matter if you average* or take the sum of the questionnaire items?*

A: No! It might seem surprising, but it doesn’t matter. If we use a summary score in a regression then the coefficients will be a bit different, because the scale of our predictor variable has changed. But the strength of the relationship will be the same, and the tests statistics we use will be identical.

We already saw how to use mutate to create a new column in the section on recoding. We can use this again to create the sum of both questions:

sweets.recoded %>%
  mutate(liking = like_sweets_numeric + like_chocolate_numeric) %>%
  select(starts_with('like'))

        like_sweets    like_chocolate like_sweets_numeric like_chocolate_numeric
1 I don't like them I don't like them                  -1                     -1
2       I'm neutral I don't like them                   0                     -1
3       I like them       I'm neutral                   1                      0
4       I'm neutral       I'm neutral                   0                      0

Explanation of the code: We added both the questions which asked about ‘liking’ together. This created a new column containing the combined score, called liking. In the third line we selected only columns with names that started with the word ‘like’, to make the output easier to inspect

We should probably save the result using a new variable name though, so we can use it later:

sweets.recoded.withsummary <- sweets.recoded %>%
  mutate(liking = like_sweets_numeric + like_chocolate_numeric) %>%
  select(ID, matches('like'), liking)

Consolidation

Consolidate all the steps shown above to do the following in a single code block:

Read in the sweets xlsx file
Rename the long column names to something short, and without spaces
Recode the ‘like sweets’ and ‘like chocolate’ columns to be numeric
Create a summary score called sweet_tooth which is the sum of the sweets and chocolate questions
Save the result of this work in a new variable called sweets.tidy

sweets <- import('sweets.xlsx')

# used to check what response values are in each question
sweets %>% 
  select(`How much do you like sweets?`) %>% 
  unique() %>% paste %>% cat

c("I don't like them", "I'm neutral", "I like them")

sweets %>% 
  select(`How much do you like chocolate`) %>% 
  unique() %>% paste %>% cat

c("I don't like them", "I'm neutral")

# we will use the same mapping for both questions because they have the same responses
sweets.map <- c("I don't like them" = -1, "I'm neutral" = 0, "I like them" = 1)


sweets.tidy <- sweets %>% 
  # now we recode the two text variables (we only need use mutate once though)
  mutate(
    like_sweets = recode(`How much do you like sweets?`, !!!sweets.map), 
    like_choc = recode(`How much do you like chocolate`, !!!sweets.map)
  ) %>% 
  # and create the summary score
  mutate(sweet_tooth = like_sweets + like_choc ) %>% 
  select(ID, like_sweets, like_choc, sweet_tooth)
  
# finally, check the result. it looks ok
sweets.tidy %>% glimpse

Rows: 4
Columns: 4
$ ID          <dbl> 6, 7, 8, 9
$ like_sweets <dbl> -1, 0, 1, 0
$ like_choc   <dbl> -1, -1, 0, 0
$ sweet_tooth <dbl> -2, -1, 1, 0

Homework

Make sure you are up to date with all of the lifesavR worksheets, and this data handling worksheet.
Keep a copy of all your code in an Rmd file, and spend 5 minutes tidying this up and adding comments to explain what the different commands do.
If you are an MSc student (or a BSc student and feel like you could do with some revision) work through the section of this worksheet which explains standard deviations and the later section on ‘effect sizes’. You don’t need to complete the whole worksheet (most of the content should be familiar) but the section on effect sizes will be important in our final session on reporting.

TARA Led session

MCQ and Survey completion

If you have not already done so, complete the MCQ now at this link:

https://forms.office.com/r/JB9hCysPEU

Everyone also now needs to complete the 2 selected surveys linked from this discourse post

A real example

The sweets data in the previous session were a very simple example.

We can practice these skills using a dataset generated by students last year. This group’s questionnaire measured sleep quality, and the data are available here:

This data file has been slightly cleaned up, but is similar to the format that your O365 output will take. The only changes I made to these data were to:

Delete identifying information (names and emails)
Remove one or two extra columns

Tidy up and recode the sleep dataset:

Import the sleep.xlsx document (download xfun::embed_file(c(‘data/student-examples/sleep.xlsx’), text=“here”))
Use rename to change the variable names to be short and easy to use.
Use mutate with recode to change at least 3 of the questionnaire items to numeric scores (watch out for reverse coding).

When you have done this,

Use mutate to combine the recoded questions into a single summary score. This can be either to sum or the average of the 6 questions you recoded.
Create a plot of this summary score, showing the distribution of values (e.g. a histogram).
Save a cleaned-up copy of the data using write_csv.

Extension exercises

‘Matching’ columns by names

This is a slightly more advanced technique. It can sometimes save time, but it’s not required to pass the course so skip this if you are short on time or finding R tricky.

Another good way to avoid backticks and long column names is to use the matches() function. Especially so if you have several questions which start in the same way and have a consistent root.

The example below selects all columns where the name includes the text "How much do you", as well as the ID and Gender columns:

sweets_how_much <- sweets %>%
  select(ID, Gender, matches("How much"))

sweets_how_much  %>% glimpse

Rows: 4
Columns: 4
$ ID                               <dbl> 6, 7, 8, 9
$ Gender                           <chr> "M", "F", "M", "F"
$ `How much do you like sweets?`   <chr> "I don't like them", "I'm neutral", "I like them", "I'm neutral"
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like them", "I'm neutral", "I'm neutral"

Explanation of the code: We used the select command to choose only the columns we needed. We used the matches() function to avoid typing each of the long question column-names by hand. When we say matches("How much"), this selects all the columns which start with the text "How much". We save this subset of columns with a new variable name, sweets_how_much

Other useful functions for this purpose are:

starts_with("str")
ends_with("str")
contains("str")
num_range("Q", 1:10) (would match variables called "Q1" or "Q01", "Q2", through to "Q10")

See https://tidyselect.r-lib.org/reference/select_helpers.html for more information.

Try selecting the two columns called Start time and Completion time using the matches function.

sweets %>%
  select(matches("time"))

           Start time     Completion time
1 2019-05-24 09:12:31 2019-05-24 09:12:35
2 2019-05-24 09:12:31 2019-05-24 09:12:35
3 2019-05-24 09:12:31 2019-05-24 09:12:35
4 2019-05-24 09:12:31 2019-05-24 09:12:35

Assessment progress

This session introduces skills required for questions 3, 4 and 5 of the data assignment.

References

Wickham, Hadley. 2014a. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.

———. 2014b. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.

Data Handling

Ben Whalley

November 2021