Data wrangling

Ben Whalley

October 2023

Most time in data analysis is spent ‘tidying up’ data: getting it into a suitable format to get started. Data scientists have a particular definition of tidy: Tidy datasets are “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row” (Wickham 2014).

It’s often not convenient for humans to enter data in a tidy way, so untidy data is probably more common than tidy data in the wild. But doing good, reproducible science demands that we document each step of our processing in a way that others can check or repeat in future. Tools like R make this easier.

Overview

In the lifesavR worksheets we used various commands from the tidyverse, like filter and group_by.

If you want to recap these commands you could use the cheatsheet, especially the part on groups and summaries.


Today we will cover three additional techniques which are important when working with real datasets:

  1. ‘Pivoting’ or reshaping data from long to wide formats (or the reverse)
  2. Adding meaningful labels to categorical variables

Before you start

Make a new rmd file in your datafluency directory, called data-wrangling.rmd and record your work in this for the next 2 sessions.

Selecting columns

The fuel data also contains variables for weight and power.

We can select just these columns and save them to a smaller dataframe like this:

carperformance <- fuel %>% 
  select(mpg, weight, power) %>% 
  head() 

Explanation of the commands

  • On the far left we have the name of the new variable which we will create: carperformance.
  • We can tell this will be a new variable because the <- symbol is just to the right, pointing at it.
  • To see what carperformance contains, look to the right of the <-. We pipe the fuel data to the select command, which selects thempg,weight, andpower columns.

Explanation of the result

When running the command you won’t see any output — but a new object was created called carperformance which contained copies of the columns from fuel we selected.

We can see the first few rows of our new smaller dataframe like this:

carperformance %>% head() 
   mpg weight power
1 21.0   1188   110
2 21.0   1304   110
3 22.8   1052    93
4 21.4   1458   110
5 18.7   1560   175
6 18.1   1569   105

Try selecting columns in a dataset for yourself:

  • Use any of the built in datasets, creating a copy with just a subset of 3 of its columns.

Pivoting longer

Data is commonly stored in either wide or long format.

If you used SPSS to do a t-test or ANOVA during your undergraduate degree, you likely stored and analysed the data in wide format.

In wide format, each row represents the observations from a single participant. Each measurement for a given participant are stored in separate columns.

This is often called row per subject data. An example is the built in attitude dataset:

attitude %>%
  head()
  rating complaints privileges learning raises critical advance
1     43         51         30       39     61       92      45
2     63         64         51       54     63       73      47
3     71         70         68       69     76       86      48
4     61         63         45       47     54       84      35
5     81         78         56       66     71       83      47
6     43         55         49       44     54       49      34

Explanation: Each row contains scores for a particular employee on various measures. To find out more about these data you can type ?attitude into the console.


Let’s say we want a single plot of all these variables, something like this:

To do this we first need to convert the data to long format. In long format, each observation is saved in its own row, rather than across multiple columns.

It’s often called “row per observation” data.

Using pivot_longer()

Pivoting is where you take a long data file (lots of rows, few columns) and make it wider. Or where you take a wide data file (lots of columns, few rows) and make it longer.

We can convert from wide to long using the pivot_longer() function, as shown in the video:

To see why the function is called ‘pivot_longer’, imagine trying to reshape just the first two rows of the attitude dataset:

  rating complaints privileges learning raises critical advance
1     43         51         30       39     61       92      45
2     63         64         51       54     63       73      47

If we use pivot_longer on this selection, we end up with this:

attitude %>%
  head(2) %>% 
  pivot_longer(everything()) 
# A tibble: 14 × 2
   name       value
   <chr>      <dbl>
 1 rating        43
 2 complaints    51
 3 privileges    30
 4 learning      39
 5 raises        61
 6 critical      92
 7 advance       45
 8 rating        63
 9 complaints    64
10 privileges    51
11 learning      54
12 raises        63
13 critical      73
14 advance       47

Explanation of the command: - We selected a subset of columns and rows. - Then we used pivot_longer(everything()) to make this into long form data. - The everything() part tells R to merge values from all of the columns into a single new column called value, and to keep track of the original variable name in a new column called name.

The change works like like this:

Converting from wide format to long format
Converting from wide format to long format

You might have spotted a problem though: We don’t have a record of which participant was which in the attitude dataset.

This is because the mapping to participants was implicit: each row was a different participant, but participant number was not actually recorded in the file.

We can create an explicit participant identifier by adding a new column. For this we use the mutate and row_number() functions:

attitude_with_person <- attitude %>%
  mutate(person = row_number()) %>%
  head(2) 

attitude_with_person %>% 
  pander()
rating complaints privileges learning raises critical advance person
43 51 30 39 61 92 45 1
63 64 51 54 63 73 47 2

Now we have a column called person which stores the row number.

But this means if we pivot_longer() again, we will need to tell R which columns we would like to pivot.

If we don’t do this then the person column gets melted with everything else so we lose track of which response belonged to which participant, like this:

attitude_with_person %>%
  pivot_longer(everything()) %>% 
  pander()
name value
rating 43
complaints 51
privileges 30
learning 39
raises 61
critical 92
advance 45
person 1
rating 63
complaints 64
privileges 51
learning 54
raises 63
critical 73
advance 47
person 2

Explanation of the output Because we didn’t tell pivot_longer which columns we wanted to pivot, it put all the values into a single new column called value. This included our participant identifier, person which is not what we wanted.


We can exclude person from the pivoting by writing:

attitude_with_person %>%
  pivot_longer(-person) %>% 
  head() %>% 
  pander()
person name value
1 rating 43
1 complaints 51
1 privileges 30
1 learning 39
1 raises 61
1 critical 92

Explanation of the command and output:

  • Here, we still use pivot_longer but this time we put -person between the parentheses.
  • The minus sign, -, means don’t include this variable, so -person ends up meaning include all columns except person, which is what we wanted.
  • The output now retains the person column, but pivots the other variables.
  • This means we can tell which person provided each datapoint.

Use some tidyverse commands you already know (e.g. select), plus pivot_longer, to produce this plot using the attitude dataset:

  • Check the cheatsheet if you get stuck
  • You need to select only the three variables shown
  • It’s not necessary to create a person identifier for this plot (although it won’t hurt if you do)
attitude %>%
  select(rating, complaints, learning) %>%
  pivot_longer(everything()) %>% 
  ggplot(aes(name, value)) +
  geom_boxplot()

Pivoting to make summary tables

Imagine we want a table of the mean score for each question in the attitude dataset.

This would be fiddly if we just tried to use summarise on wide format data. But if we use pivot_longer, group_by and then summarise (in that order) it’s possible to take the data and make a table like this with 3 instructions to R:

Name Mean SD
advance 42.93 10.29
complaints 66.6 13.31
critical 74.77 9.895
learning 56.37 11.74
privileges 53.13 12.24
raises 64.63 10.4
rating 64.63 12.17

Combine the pivot_longer, group_by and summarise commands (in that order) to reproduce the table above.

  • You want to pivot all of the variables in the attitude dataset this time
  • We covered using summarise in the third lifesavR session here.

We might also want to produce summary statistics per-participant.

Using the commands shown above (and remembering to make a new column to store the participant number with row_number()), reproduce this table:

person mean(value)
1 51.57
2 59.29
3 69.71
4 55.57
5 68.86

To make the table you will need to use the following functions, in roughly this order:

  • mutate
  • filter
  • pivot_longer
  • group_by
  • summarise
  • pander
attitude %>% 
  mutate(person = row_number()) %>% 
  filter(person < 6) %>% 
  pivot_longer(-person) %>% 
  group_by(person) %>% 
  summarise(mean(value)) %>% 
  pander()

Pivoting wider

Sometimes we have the opposite problem: We have long data, but want it in wide format. For example, we might want a table where it’s easy to compare between different years, like this:

development %>%
  filter(year > 1990) %>% 
  pivot_wider(id_cols=country, 
              names_from=year, 
              values_from=gdp_per_capita) %>% 
  head(3) %>% 
  pander::pander("GDP per-capita in 3 countries in 4 different years, from the development dataset.")
GDP per-capita in 3 countries in 4 different years, from the development dataset.
country 1992 1997 2002 2007
Afghanistan 649.3 635.3 726.7 974.6
Albania 2497 3193 4604 5937
Algeria 5023 4797 5288 6223

Instead of making the data longer, now we want to pivot_wider.

The development data is a fairly long format. There are multiple rows per-country corresponding to different years.

We want to compare GDP in different years.

We first need to select the data we want — country, year and GDP, for the years after 1990:

development1990s <- development %>%
  select(country, year, gdp_per_capita) %>%
  filter(year >= 1990)

Then we can pivot_wider():

development1990s %>%
  pivot_wider(
    names_from = year, 
    values_from = gdp_per_capita
  ) %>% 
  head() %>% 
  pander()
country 1992 1997 2002 2007
Afghanistan 649.3 635.3 726.7 974.6
Albania 2497 3193 4604 5937
Algeria 5023 4797 5288 6223
Angola 2628 2277 2773 4797
Argentina 9308 10967 8798 12779
Australia 23425 26998 30688 34435

Explanation of the command and output:

  • We started with multiple rows per country, corresponding to years.
  • We used pivot_wider with names_from = year to create new columns for each year in the data.
  • We used values_from=gdp_per_capita to tell pivot_longer to use the GDP numbers to populate the table.
  • The resulting table helps us compare years within countries, or between countries, for a given year.

Use the funimagery dataset in psydata and perform the following:

  • use select to make a dataset with intervention and each of the kg1 to kg3 columns
  • Use pivot_longer, group_by and summarise to calculate the average weight of participants at each timepoint
  • Adapt the group_by function to calculate the mean at each timepoint for each group separately
  • Add pivot_wider to the end of your code to create a separate column for each group.

When you finish your data should look like this:

name MI FIT
kg1 89.86 91.46
kg2 88.62 86.37
kg3 88.46 84.04
funimagery %>% 
  select(intervention, kg1, kg2, kg3) %>% 
  pivot_longer(-intervention) %>% 
  group_by(name, intervention) %>% 
  summarise(M=mean(value)) %>% 
  pivot_wider(names_from=intervention, values_from=M) %>% 
  pander()

Separating variables

Sometimes we need to separate ‘untidy’ variables into tidy, long-form data.

The code below generates simulated data for 100 individuals at three time points. The format is similar to the way you might record experimental data in a spreadsheet.

set.seed(1234)
N <- 100
repeatmeasuresdata <- tibble(person = 1:N,
                              time_1 = rnorm(N),
                              time_2 = rnorm(N, 1),
                              time_3 = rnorm(N, 3))

repeatmeasuresdata %>% head(8) %>% 
  pander()
person time_1 time_2 time_3
1 -1.207 1.415 3.485
2 0.2774 0.5253 3.697
3 1.084 1.066 3.186
4 -2.346 0.4975 3.701
5 0.4291 0.174 3.312
6 0.5061 1.167 3.76
7 -0.5747 0.1037 4.842
8 -0.5466 1.168 4.112

This variable, repeatmeasuresdata, is in wide format. Each row contains data for one participant, and each participant has three observations.

As we saw previously, we can pivot — i.e., reshape — the data into longer format like so:

repeatmeasuresdata %>%
  pivot_longer(starts_with("time")) %>%
  arrange(person, name) %>%
  head(8) %>% 
  pander()
person name value
1 time_1 -1.207
1 time_2 1.415
1 time_3 3.485
2 time_1 0.2774
2 time_2 0.5253
2 time_3 3.697
3 time_1 1.084
3 time_2 1.066

The problem we have now is that name contains text which describes at which time the observation was made. We probably want to store a number for each time-point, so we can make a plot with time on the x axis.

The separate command separates a single character column (name) into multiple columns. Rather than have a column with labels of the form ‘time_1’, it can create two columns, with labels ‘time’ and ‘1’ in each.

# convert to long form; extract the `time` as a new numeric column
longrepeatmeasuresdata <- repeatmeasuresdata %>%
  pivot_longer(starts_with("time")) %>%
  separate(name, into = c("variable", "time"))

longrepeatmeasuresdata %>% head %>% 
  pander()
person variable time value
1 time 1 -1.207
1 time 2 1.415
1 time 3 3.485
2 time 1 0.2774
2 time 2 0.5253
2 time 3 3.697

Now the data are in long format, we can plot the points over time:

longrepeatmeasuresdata %>%
  sample_n(30) %>%
  ggplot(aes(x=time, y=value)) +
  geom_point()

How does R know where to split the text?

In the example above, separate split data like "time_1", "time_2" etc into two columns: variable and time.

Q: How did it know to use the underscore (_) to split the data?

A: The default is to split on anything which is not a letter or number. So _ or a space, or , would all work.

Sometimes though we need to tell R explicitly what to use to sepatate the values.

If we had a column of email addresses we could split ben.whalley@plymouth.ac.uk into the username (e.g. ben.whalley) and domain name (plymouth.ac.uk) using the @ symbol.

To do this we just write sep="@" when we use separate.

The messy_exp dataset in psydata contains simulated RT data on 100 participants in 2 conditions (A and B) at three time points (1, 2, and 3).

  • Use the separate() function to split up the condition variable in this dataset and draw the following plot:

messy_exp %>% 
  separate(condition, into=c("participant", "condition", "time")) %>% 
  ggplot(aes(time, rt, color=condition)) + 
  geom_boxplot(width=.5) + 
  labs(x="Time", y="Reaction time (ms)", color="Condition")
  1. This file contains sample contact and address data for 100 people: https://letterhub.com/wp-content/uploads/2018/03/100-contacts.csv
  • Read the data into R (you can either use the URL above directly inside the read_csv() function, or download then re-upload the data to the server to do this)

  • Use the separate function to make a new variable which contains the domain name of these contacts’ email address (e.g. yahoo.com, hotmail.com)

Note, you will need to use sep="@" to split the email addresses at the @ symbol

  1. Use the distinct and/or count functions on the new variable you create containing the domain name. Look them up in the help file if you don’t know which to use to answer these questions:
  • How many people had a Gmail account?
  • Which domains had more than 10 users?
# read the data directly from the URL
contacts <- read_csv('https://letterhub.com/wp-content/uploads/2018/03/100-contacts.csv') %>% 
  separate(email, into=c("user", "domain"), sep ="@")  # uses the @ symbol as a separator
# how many _different_ domains are there?
contacts %>% 
  distinct(domain) %>% 
  count() %>% 
  pander()
n
39
# how many people use gmail
contacts %>% 
  count(domain) %>% 
  filter(domain=="gmail.com") %>% 
  pander()
domain n
gmail.com 16
# which domains had more than 10 users?
contacts %>% 
  count(domain) %>% 
  filter(n > 10) %>% 
  pander()
domain n
aol.com 13
gmail.com 16
hotmail.com 13
yahoo.com 13

Questionnaire data

The file sweets.csv contains a small number of example rows of data exported from an online survey.

The file is at: https://is.gd/axenun

We can look at the first few rows of the data, using the glimpse command:

sweets <- read_csv('https://is.gd/axenun')
sweets %>% glimpse()
Rows: 4
Columns: 8
$ ID                               <dbl> 6, 7, 8, 9
$ `Start time`                     <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12…
$ `Completion time`                <dttm> 2019-05-24 09:12:35, 2019-05-24 09:1…
$ Email                            <chr> "anonymous", "anonymous", "anonymous…
$ Name                             <lgl> NA, NA, NA, NA
$ `How much do you like sweets?`   <chr> "I don't like them", "I'm neutral", "…
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like th…
$ Gender                           <chr> "M", "F", "M", "F"
  • Import the sweets data as shown above from: https://is.gd/axenun

  • Save it to a new variable called sweets

Tidying questionnaires

When we look at the imported data it’s useful to note:

  1. There are extra columns we don’t need (at least for now).

  2. Some of our variable names are very long and annoying to type (for example How much do you like sweets? is the name of one of our columns).

  3. Our responses are in text format, rather than as numbers. For example, the data say "I don't like them" or "I'm neutral" rather than numbers from a 1-5 scale.

We need to sort each of these problems to make things more manageable for our analysis.

Selecting and renaming

Remember, R makes using columns with spaces or other special characters very hard. We want to avoid this.

Selecting

To use columns with spaces in we must ‘escape’ the spaces and let R know they are part of the name rather than a gap between two different names.

This video shows how (or read below):

To escape spaces and use columns with long names we use the backtick character (the backwards facing apostrophe) to wrap the column name.

In general, if your columns contain spaces or other odd characters like hyphens or question marks then you will need to wrap them in backticks.

Renaming

Some of the imported variable names in the sweets data are long and awkward to use.

Most researchers would rename these variables, to make them more usable in R code.

You can rename variables like this:

datasetname %>% 
  rename(NEW_COLUMN_NAME = OLD_COLUMN_NAME)

So for this example:

sweets %>%
  rename(
    like_sweets = `How much do you like sweets?`,
    like_chocolate = `How much do you like chocolate`,
  )
# A tibble: 4 × 8
     ID `Start time`        `Completion time`   Email     Name  like_sweets     
  <dbl> <dttm>              <dttm>              <chr>     <lgl> <chr>           
1     6 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA    I don't like th…
2     7 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA    I'm neutral     
3     8 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA    I like them     
4     9 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA    I'm neutral     
# ℹ 2 more variables: like_chocolate <chr>, Gender <chr>

Explanation of the code: We used rename to change the names of our variables. We needed to wrap the long names of the questions in ‘backtick’ symbols to make sure R understood it was a single column name.

You should create a new variable to save the renamed dataset (with a descriptive name for use later on):

# create a new variable containing the renamed dataset
sweets.renamed <- sweets %>%
  rename(
    like_sweets = `How much do you like sweets?`,
)
  1. Create a copy of the sweets data in which you have selected only the two columns with long names.

  2. Create a second copy of the data where you have renamed the columns with long names to something short, and without spaces.

Renaming with the janitor package

A good alternative to renaming variables manually is to use the clean_names function in the janitor package.

sweets %>% 
  janitor::clean_names() %>% 
  glimpse
Rows: 4
Columns: 8
$ id                             <dbl> 6, 7, 8, 9
$ start_time                     <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:3…
$ completion_time                <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:…
$ email                          <chr> "anonymous", "anonymous", "anonymous",…
$ name                           <lgl> NA, NA, NA, NA
$ how_much_do_you_like_sweets    <chr> "I don't like them", "I'm neutral", "I …
$ how_much_do_you_like_chocolate <chr> "I don't like them", "I don't like them…
$ gender                         <chr> "M", "F", "M", "F"

Explanation of the code and result. I used the clean_names function within the janitor package without using library. I did this by typing janitor:: and then the name of the function. In the result clean_names has made a new dataset:

  • Removed all special characters
  • Made everything lower case (easier for R to autocomplete)
  • Replaced spaces with underscores
  • Made column names unique (this isn’t always the case with imported data, but is important for R)

I typically use this function when importing any new data because it makes the naming and access of columns much more consistent and easier to remember.

Recoding text

We noticed above that our responses were stored as text labels like "I don't like them" rather than on a numeric scale. This makes it hard to use in an analysis.

We need to recode the text variables into numeric versions.

How to do it

First we must tell R what number we want to use for each text label. That is, we create a mapping of numbers to labels.

This takes a few steps:

  1. Check exactly what the text values are which need to be mapped.
  2. Make a mapping variable which assigns each text value a number value
  3. Use the recode function with mutate to create a new, numeric column

This video walks you through the steps below:

Step 1: Check EXACTLY what text labels we have

To check which labels we need to recode, I select the column in question and use the unique() function.

# check exactly what text values are in the dataset?
sweets %>% 
  select(`How much do you like sweets?`) %>% 
  unique() 
# A tibble: 3 × 1
  `How much do you like sweets?`
  <chr>                         
1 I don't like them             
2 I'm neutral                   
3 I like them                   

Do the same to find out the possible values of the How much do you like chocolate column.

sweets %>% 
  select(`How much do you like chocolate`) %>% 
  distinct() %>% 
  pander()
How much do you like chocolate
I don’t like them
I’m neutral

Step 2: Make a mapping variable

We do this by creating what R calles a named vector, which is a special kind of list.

To make a named vector we use the the c() function. The letter c here just stands for ‘combine’ — i.e. ‘combine these things into a list’.

This is a simple example:

mapping.list <- c("No" = 0, "Yes" = 1)

We could then use this mapping to recode a column of data which contained the words “No” or “Yes


A useful trick when creating your own mappings is to use R to do the formatting for you (see the video above for a demo).

Re-using the code from the previous step, we use unique() to show us the unique values for the question about sweets.

We then pipe the result to the paste() and cat() functions, like this:

# the hack we use as a short-cut to creating a mapping variable
sweets %>% 
  select(`How much do you like sweets?`) %>% 
  unique() %>% 
  paste() %>% cat()
c("I don't like them", "I'm neutral", "I like them")

Explanation of the output: Using paste and cat is a bit of a hack. When we run this code we see the output c("I don't like them", "I'm neutral", "I like them"). This is a list of the values in the sweets data for this question, formatted in a way that will be useful to us in the next step.

We then copy and paste this output into a NEW code block, and EDIT it to assign our mappings:

preference.mappings <- c("I don't like them" = -1, "I'm neutral" = 0, "I like them" = 1)

Explanation of the code: We used the previous output to create a mapping. By adding the parts which read = -1 and = 0 etc, we have told R what value we want to assign for each label.

Q: How do you know what number values to assign?

A: It doesn’t matter, provided:

  • The intervals between each options are the same and
  • Each text value has a different number

So, if we had a Likert-scale ranging from “Completely agree” to “Completely disagree” in 7 increments, we could score this from 0 - 6 or 1 - 7, or -3 - 3. These would all be fine.

Step 3: Use the mapping variable to recode the column

We can use our new mapping with the mutate and recode functions to make a new column, containing numbers rather than text:

sweets.recoded <-  sweets %>% 
    rename(
      like_sweets = `How much do you like sweets?`,
      like_chocolate = `How much do you like chocolate`,
    ) %>% 
    # use recode to convert text response using preference.mappings
    mutate(
        like_sweets_numeric =
            recode(like_sweets, !!!preference.mappings)
    )

We can see this new column if we use glimpse:

sweets.recoded %>% glimpse()
Rows: 4
Columns: 9
$ ID                  <dbl> 6, 7, 8, 9
$ `Start time`        <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-…
$ `Completion time`   <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05…
$ Email               <chr> "anonymous", "anonymous", "anonymous", "anonymous"
$ Name                <lgl> NA, NA, NA, NA
$ like_sweets         <chr> "I don't like them", "I'm neutral", "I like them"…
$ like_chocolate      <chr> "I don't like them", "I don't like them", "I'm neu…
$ Gender              <chr> "M", "F", "M", "F"
$ like_sweets_numeric <dbl> -1, 0, 1, 0

Explanation of the code:

  • The start of the first line is sweets.recoded <- which means make a new variable called sweets.recoded.
  • Then we use mutate to create a new column called like_sweets_numeric.
  • We make this column using recode on the question about liking sweets.
  • We use the preference.mappings mapping to specify what numeric score to give each of the text values.

Watch out for the exclamation marks!!!: In the code above there are three exclamation marks, !!!, before the mapping; make sure you do the same.

Three exclamation marks: !!!

Understanding this isn’t necessary to get on with the course. Only read this if you are interested!

In the code above when we used recode we used three exclamation marks just before our list.

We defined the mapping:

likert.responses <- c(
            "I hate them" = 1,
            "I don't like them" = 2,
            "I'm neutral" = 3,
            "I like them" = 4,
            "I can't live without them" = 5)

And then used it with recode, with the three exclamation marks.

liking_of_sweets_data %>%
    mutate(like_sweets_numeric = recode(like_sweets_text, !!!likert.responses)) %>% 
  pander()

The reason for this is that recode actually expects us to specify the mapping for it like this:

liking_of_sweets_data %>%
    mutate(like_sweets_numeric = recode(like_sweets_text,
        "I hate them" = 1,
        "I don't like them" = 2 ...))

But this means we have to repeat the mapping for each of the questions. Because all the questions use the same mapping this gets repetitive, and can lead to errors.

The three exclamation marks !!! unpacks the list for us. So writing !!!likert.responses saves us the bother of writing it out in full each time.

Summary/video explanation

This is one of the trickiest bits of R code we use. I’ve included an annotated video of just these steps as a reference below:

  • Use this three-step process to create a recoded version of the like_chocolate variable.

  • Remember to watch the video at the start of this section, or the short version in the green box above, if anything is unclear.

Combining scores

Often questionnaires are designed to make repeated measurements of the same phenomena, which can then be summed or averaged to create a more reliable measure.

We’ve already seen how mutate() creates a new column. We can use this again to create the sum of responses to both questions:

sweets.recoded %>%
  # mutate to create column containing sum of both recoded questions
  mutate(liking = like_sweets_numeric + like_chocolate_numeric) %>%
  # select the new variable we created
  select(ID, liking) %>% 
  pander()
ID liking
6 -2
7 -1
8 1
9 0

Explanation of the code:

  • We added both the questions which asked about ‘liking’ together. This created a new column containing the combined score, called liking.
  • In the third line we selected only the ID column, plus the new column we made.

Using ChatGPT to automate things

In the sections above I explained how to tidy up a simple dataset. However it turns out many of these tasks are easily solved using ChatGPT.

You now have a choice you didn’t have a few years ago:

  1. Learn how to do these steps by hand (and sometimes use ChatGPT to automate the work)
  2. Learn just enough to be able to get ChatGPT to write code for you (and hope it’s correct)

I would recommend option 1 (at least until LLMs make improvements in code generation) because it will likely be faster, more accurate and will promote your understanding.

But as a guide, here’s how I might solve this problem with ChatGPT as an assistant.

Step 1. Show ChatGPT the data

I use these command to output some information about the data, in a format that’s easy to copy into ChatGPT.

sweets <- read_csv('https://is.gd/axenun')

# lists the column names
sweets %>% names()

sweets %>% glimpse()

# list the unique values in response to this question
sweets %>% count(`How much do you like sweets?`) %>% 
  pander()

Step 2: Ask GPT to recode it for us

This link shows how I asked ChatGPT to recode the sweets data:

https://chat.openai.com/share/325b1f77-c68c-417a-8e75-76b05a6b9633



Note that I:

  • told it the names of the columns
  • showed it a sample of the data (pasted in the output of glimpse)
  • explicitly listed the values in the columns I wanted to recode (pasted in output of count)
  • directly asked for R code
  • gave instructions to use the tidyverse (this makes the output more consistent with what you have learned)

Step 3: Debug and validate the R code GPT provides

In this example, ChatGPT provided the following code:

library(tidyverse)

# Load the sweets dataset (assuming it's already loaded)
sweets <- read_csv('https://is.gd/axenun')

# Recode the text responses into numeric variables
sweets <- sweets %>%
  mutate(Like_sweets = recode(`How much do you like sweets?`,
    "I don't like them" = 0,
    "I like them" = 1,
    "I'm neutral" = 2
  )) %>%
  mutate(Like_chocolate = recode(`How much do you like chocolate`,
    "I don't like them" = 0,
    "I like them" = 1,
    "I'm neutral" = 2
  )) %>%
  select(-`How much do you like sweets?`, -`How much do you like chocolate`)

# Rename variables with spaces to use underscores and be shorter
sweets <- sweets %>%
  rename(
    ID = `ID`,
    Start_time = `Start time`,
    Gender = `Gender`
  )

# Remove timestamp and email columns
sweets <- sweets %>%
  select(-`Start_time`, -`Email`)

# Check the modified dataset
glimpse(sweets)

Amazingly, this actually works!

However it’s worth saying that it often doesn’t work on first try, and it is dependent on writing a good prompt to the model, and giving it plenty of information about the data and what you want to achieve.

Some tips:

  • Think first. Be clear about what you want to achieve.
  • Explain in detail what the data are like and show examples.
  • Give explicit instructions.
  • If code doesn’t work, ask for explanations or corrections.
  • Break tasks down into smaller steps (ask for less and build up your script).
  • Paste in the code you already have and ask for a completion with one extra step (e.g. something you don’t know how to do).
  • If you can afford it, use ChatGPT 4 or try new R-specific LLMs as they are (inevitably) released.

Where there are bugs in ChatGPT’s code, you can

  • fix these bugs by hand, or
  • run the code, report the errors to ChatGPT and ask it to fix them for you.

The second option has some risks (ChatGPT’s code might delete your data, although it’s unlikely), and is probably slower. In this instance it took me quite a while to get ChatGPT to fix the simple errors, and this required understanding the code to some degree anyway.

As I recommend in the lectures, knowing how to use R well enough to fix the small errors introduced is likely to be more productive in the long run. Think of LLMs as your assustant rather than your replacement.

Consolidation activity

Use this example dataset: https://is.gd/ayinof

  • Read in the data.
  • Rename the long column names to something short, and without spaces
  • Recode at least three of the columns with data about sleep quality to be numeric
  • Save the result of this work in a new variable called sleep.tidy
  • Pivot the recoded variables and make a boxplot of responses to these questions
  • Create a summary score called sleep_quality which is the sum of these recoded questions (use mutate)
  • Create a density plot of this summary score and interpret what you see (describe the pattern in plain English)
sleep <- read_csv('https://is.gd/ayinof')

# used to check what response values are in each question
sleep %>% 
  select(`My sleep is affected by my study commitments`) %>% 
  unique() %>% paste %>% cat
c("Agree", "Somewhat agree", "Somewhat disagree", "Disagree", "Neither agree nor disagree", "Strongly agree", "Strongly disagree")
sleep %>% 
  select(`My electronic device usage negatively affects my sleep`) %>% 
  unique() %>% paste %>% cat
c("Disagree", "Strongly agree", "Somewhat disagree", "Agree", "Somewhat agree", "Neither agree nor disagree", "Strongly disagree")
# we will use the same mapping for both questions because they have the same responses
sleep.map <- c("Agree"=2, 
               "Somewhat agree"=1, 
               "Somewhat disagree"=-1, 
               "Disagree"=-2, 
               "Neither agree nor disagree"=0, 
               "Strongly agree"=3, 
               "Strongly disagree"=-3)


sleep.tidy <- sleep %>% 
  # now we recode the two text variables (we only need use mutate once though)
  mutate(
    sleep_study = recode(`My sleep is affected by my study commitments`, !!!sleep.map), 
    sleep_electronic = recode(`My electronic device usage negatively affects my sleep`, !!!sleep.map)
  )
# now we can pivot longer to make a plot
sleep.tidy %>% 
  pivot_longer(c(sleep_study, sleep_electronic)) %>% 
  ggplot(aes(name, value)) + geom_boxplot()

And make a summary score combining both questions

sleep.tidy.withsumary <- sleep.tidy %>% 
  # and create the summary score
  mutate(sleep_quality = sleep_study + sleep_electronic ) 
  
# check the result. it looks ok
sleep.tidy.withsumary %>% glimpse
Rows: 241
Columns: 13
$ uniqueid                                                                                    <chr> …
$ `Start time`                                                                                <dttm> …
$ `Completion time`                                                                           <dttm> …
$ `My sleep is affected by my study commitments`                                              <chr> …
$ `I achieve good quality sleep`                                                              <chr> …
$ `My electronic device usage negatively affects my sleep`                                    <chr> …
$ `Tiredness interferes with my concentration`                                                <chr> …
$ `My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children...` <chr> …
$ `I often achieve eight hours of sleep`                                                      <chr> …
$ `I regularly stay up past 11pm`                                                             <chr> …
$ sleep_study                                                                                 <dbl> …
$ sleep_electronic                                                                            <dbl> …
$ sleep_quality                                                                               <dbl> …
# finally, make the requested density plot
# make the density plot
sleep.tidy.withsumary %>% 
  ggplot(aes(sleep_quality)) + 
  geom_density()

Check your knowledge

  • What does it mean for data to be “tidy”? Identify the three key characteristics

  • What function creates a new column in a dataset?

  • Which function allows you to choose a subset of the columns in a dataset?

  • What does pivot_longer do?

  • Why is pivot_longer useful when we want to make a faceted plot?

  • Why is wide data sometimes more useful than long data?

  • If you have RT data with 10 groups, what tidyverse ‘verbs’ (functions) would you use to calculate the mean for each group?

  • How can you read data into a dataframe from over the internet?

  • Does recode convert from text to numeric, or from numeric to text values?

  • Why is it important to recode text variables into numeric values when working with survey or questionnaire data?

  • Why is it important to copy and paste exact values when making a mapping variable for recode?

  • When renaming, is it: rename(oldvarname=newvarname) or rename(newvarname=oldvarname)?

  • What does the clean_names function from the janitor package do?

  • Imagine a single cell of your dataset contains the string “conditionA_time_1”. What function should we apply to it?

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.