Most time in data analysis is spent ‘tidying up’ data: getting it into a suitable format to get started. Data scientists have a particular definition of tidy: Tidy datasets are “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row” (Wickham 2014).
It’s often not convenient for humans to enter data in a tidy way, so untidy data is probably more common than tidy data in the wild. But doing good, reproducible science demands that we document each step of our processing in a way that others can check or repeat in future. Tools like R make this easier.
Overview
In the lifesavR
worksheets we used various commands from the tidyverse
,
like filter
and group_by
.
If you want to recap these commands you could use the cheatsheet, especially the part on groups and summaries.
Today we will cover three additional techniques which are important when working with real datasets:
- ‘Pivoting’ or reshaping data from long to wide formats (or the reverse)
- Adding meaningful labels to categorical variables
Before you start
Make a new rmd
file in your datafluency directory,
called data-wrangling.rmd
and record your work in this for
the next 2 sessions.
Selecting columns
The fuel
data also contains variables for weight and
power.
We can select just these columns and save them to a smaller dataframe like this:
Explanation of the commands
- On the far left we have the name of the new variable which we will
create:
carperformance
. - We can tell this will be a new variable because the
<-
symbol is just to the right, pointing at it. - To see what
carperformance
contains, look to the right of the<-
. We pipe thefuel
data to theselect
command, which selects thempg
,weight
, andpower
columns.
Explanation of the result
When running the command you won’t see any output — but a new object
was created called carperformance
which contained copies of
the columns from fuel
we selected.
We can see the first few rows of our new smaller dataframe like this:
mpg weight power
1 21.0 1188 110
2 21.0 1304 110
3 22.8 1052 93
4 21.4 1458 110
5 18.7 1560 175
6 18.1 1569 105
Try selecting columns in a dataset for yourself:
- Use any of the built in datasets, creating a copy with just a subset of 3 of its columns.
Pivoting longer
Data is commonly stored in either wide or long format.
If you used SPSS to do a t-test or ANOVA during your undergraduate degree, you likely stored and analysed the data in wide format.
In wide format, each row represents the observations from a single participant. Each measurement for a given participant are stored in separate columns.
This is often called row per subject data. An
example is the built in attitude
dataset:
rating complaints privileges learning raises critical advance
1 43 51 30 39 61 92 45
2 63 64 51 54 63 73 47
3 71 70 68 69 76 86 48
4 61 63 45 47 54 84 35
5 81 78 56 66 71 83 47
6 43 55 49 44 54 49 34
Explanation: Each row contains scores for a
particular employee on various measures. To find out more about these
data you can type ?attitude
into the console.
Let’s say we want a single plot of all these variables, something like this:
To do this we first need to convert the data to long format. In long format, each observation is saved in its own row, rather than across multiple columns.
It’s often called “row per observation” data.
Using pivot_longer()
Pivoting is where you take a long data file (lots of rows, few columns) and make it wider. Or where you take a wide data file (lots of columns, few rows) and make it longer.
We can convert from wide to long using the
pivot_longer()
function, as shown in the video:
To see why the function is called ‘pivot_longer’, imagine trying to reshape just the first two rows of the attitude dataset:
rating complaints privileges learning raises critical advance
1 43 51 30 39 61 92 45
2 63 64 51 54 63 73 47
If we use pivot_longer
on this selection, we end up with
this:
# A tibble: 14 × 2
name value
<chr> <dbl>
1 rating 43
2 complaints 51
3 privileges 30
4 learning 39
5 raises 61
6 critical 92
7 advance 45
8 rating 63
9 complaints 64
10 privileges 51
11 learning 54
12 raises 63
13 critical 73
14 advance 47
Explanation of the command: - We selected a subset
of columns and rows. - Then we used
pivot_longer(everything())
to make this into long form
data. - The everything()
part tells R to merge values from
all of the columns into a single new column called value
,
and to keep track of the original variable name in a new column called
name
.
The change works like like this:
You might have spotted a problem though: We don’t have a record of
which participant was which in the attitude
dataset.
This is because the mapping to participants was implicit: each row was a different participant, but participant number was not actually recorded in the file.
We can create an explicit participant identifier by adding a new
column. For this we use the mutate
and
row_number()
functions:
attitude_with_person <- attitude %>%
mutate(person = row_number()) %>%
head(2)
attitude_with_person %>%
pander()
rating | complaints | privileges | learning | raises | critical | advance | person |
---|---|---|---|---|---|---|---|
43 | 51 | 30 | 39 | 61 | 92 | 45 | 1 |
63 | 64 | 51 | 54 | 63 | 73 | 47 | 2 |
Now we have a column called person
which stores the row
number.
But this means if we pivot_longer()
again, we will need to tell R which columns we would like to
pivot.
If we don’t do this then the person
column gets melted
with everything else so we lose track of which response belonged to
which participant, like this:
name | value |
---|---|
rating | 43 |
complaints | 51 |
privileges | 30 |
learning | 39 |
raises | 61 |
critical | 92 |
advance | 45 |
person | 1 |
rating | 63 |
complaints | 64 |
privileges | 51 |
learning | 54 |
raises | 63 |
critical | 73 |
advance | 47 |
person | 2 |
Explanation of the output Because we didn’t tell
pivot_longer
which columns we wanted to pivot, it put all
the values into a single new column called value
. This
included our participant identifier, person
which is not
what we wanted.
We can exclude person
from the pivoting by writing:
person | name | value |
---|---|---|
1 | rating | 43 |
1 | complaints | 51 |
1 | privileges | 30 |
1 | learning | 39 |
1 | raises | 61 |
1 | critical | 92 |
Explanation of the command and output:
- Here, we still use
pivot_longer
but this time we put-person
between the parentheses. - The minus sign,
-
, means don’t include this variable, so-person
ends up meaning include all columns exceptperson
, which is what we wanted. - The output now retains the
person
column, but pivots the other variables. - This means we can tell which person provided each datapoint.
Use some tidyverse commands you already know
(e.g. select
), plus pivot_longer
, to produce
this plot using the attitude
dataset:
- Check the cheatsheet if you get stuck
- You need to select only the three variables shown
- It’s not necessary to create a person identifier for this plot (although it won’t hurt if you do)
Pivoting to make summary tables
Imagine we want a table of the mean score for each question in the attitude dataset.
This would be fiddly if we just tried to use summarise
on wide format data. But if we use pivot_longer
,
group_by
and then summarise
(in that order)
it’s possible to take the data and make a table like this with 3
instructions to R:
Name | Mean | SD |
---|---|---|
advance | 42.93 | 10.29 |
complaints | 66.6 | 13.31 |
critical | 74.77 | 9.895 |
learning | 56.37 | 11.74 |
privileges | 53.13 | 12.24 |
raises | 64.63 | 10.4 |
rating | 64.63 | 12.17 |
Combine the pivot_longer
, group_by
and
summarise
commands (in that order) to reproduce the table
above.
- You want to pivot all of the variables in the
attitude
dataset this time - We covered using
summarise
in the thirdlifesavR
session here.
We might also want to produce summary statistics per-participant.
Using the commands shown above (and remembering to make a new column
to store the participant number with row_number()
),
reproduce this table:
person | mean(value) |
---|---|
1 | 51.57 |
2 | 59.29 |
3 | 69.71 |
4 | 55.57 |
5 | 68.86 |
To make the table you will need to use the following functions, in roughly this order:
- mutate
- filter
- pivot_longer
- group_by
- summarise
- pander
Pivoting wider
Sometimes we have the opposite problem: We have long data, but want it in wide format. For example, we might want a table where it’s easy to compare between different years, like this:
development %>%
filter(year > 1990) %>%
pivot_wider(id_cols=country,
names_from=year,
values_from=gdp_per_capita) %>%
head(3) %>%
pander::pander("GDP per-capita in 3 countries in 4 different years, from the development dataset.")
country | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|
Afghanistan | 649.3 | 635.3 | 726.7 | 974.6 |
Albania | 2497 | 3193 | 4604 | 5937 |
Algeria | 5023 | 4797 | 5288 | 6223 |
Instead of making the data longer, now we want to
pivot_wider
.
The development
data is a fairly long format. There are
multiple rows per-country corresponding to different years.
We want to compare GDP in different years.
We first need to select the data we want — country
,
year
and GDP
, for the years after 1990:
Then we can pivot_wider()
:
development1990s %>%
pivot_wider(
names_from = year,
values_from = gdp_per_capita
) %>%
head() %>%
pander()
country | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|
Afghanistan | 649.3 | 635.3 | 726.7 | 974.6 |
Albania | 2497 | 3193 | 4604 | 5937 |
Algeria | 5023 | 4797 | 5288 | 6223 |
Angola | 2628 | 2277 | 2773 | 4797 |
Argentina | 9308 | 10967 | 8798 | 12779 |
Australia | 23425 | 26998 | 30688 | 34435 |
Explanation of the command and output:
- We started with multiple rows per country, corresponding to years.
- We used
pivot_wider
withnames_from = year
to create new columns for each year in the data. - We used
values_from=gdp_per_capita
to tellpivot_longer
to use the GDP numbers to populate the table. - The resulting table helps us compare years within countries, or between countries, for a given year.
Use the funimagery
dataset in psydata
and
perform the following:
- use
select
to make a dataset withintervention
and each of thekg1
tokg3
columns - Use
pivot_longer
,group_by
andsummarise
to calculate the average weight of participants at each timepoint - Adapt the
group_by
function to calculate the mean at each timepoint for each group separately - Add
pivot_wider
to the end of your code to create a separate column for each group.
When you finish your data should look like this:
name | MI | FIT |
---|---|---|
kg1 | 89.86 | 91.46 |
kg2 | 88.62 | 86.37 |
kg3 | 88.46 | 84.04 |
Separating variables
Sometimes we need to separate ‘untidy’ variables into tidy, long-form data.
The code below generates simulated data for 100 individuals at three time points. The format is similar to the way you might record experimental data in a spreadsheet.
set.seed(1234)
N <- 100
repeatmeasuresdata <- tibble(person = 1:N,
time_1 = rnorm(N),
time_2 = rnorm(N, 1),
time_3 = rnorm(N, 3))
repeatmeasuresdata %>% head(8) %>%
pander()
person | time_1 | time_2 | time_3 |
---|---|---|---|
1 | -1.207 | 1.415 | 3.485 |
2 | 0.2774 | 0.5253 | 3.697 |
3 | 1.084 | 1.066 | 3.186 |
4 | -2.346 | 0.4975 | 3.701 |
5 | 0.4291 | 0.174 | 3.312 |
6 | 0.5061 | 1.167 | 3.76 |
7 | -0.5747 | 0.1037 | 4.842 |
8 | -0.5466 | 1.168 | 4.112 |
This variable, repeatmeasuresdata
, is in
wide format. Each row contains data for one
participant, and each participant has three observations.
As we saw previously, we can pivot — i.e., reshape — the data into longer format like so:
repeatmeasuresdata %>%
pivot_longer(starts_with("time")) %>%
arrange(person, name) %>%
head(8) %>%
pander()
person | name | value |
---|---|---|
1 | time_1 | -1.207 |
1 | time_2 | 1.415 |
1 | time_3 | 3.485 |
2 | time_1 | 0.2774 |
2 | time_2 | 0.5253 |
2 | time_3 | 3.697 |
3 | time_1 | 1.084 |
3 | time_2 | 1.066 |
The problem we have now is that name
contains text which
describes at which time the observation was made. We probably want to
store a number for each time-point, so we can make a plot with
time on the x axis.
The separate
command separates a single character column
(name
) into multiple columns. Rather than have a column
with labels of the form ‘time_1’, it can create two columns, with labels
‘time’ and ‘1’ in each.
# convert to long form; extract the `time` as a new numeric column
longrepeatmeasuresdata <- repeatmeasuresdata %>%
pivot_longer(starts_with("time")) %>%
separate(name, into = c("variable", "time"))
longrepeatmeasuresdata %>% head %>%
pander()
person | variable | time | value |
---|---|---|---|
1 | time | 1 | -1.207 |
1 | time | 2 | 1.415 |
1 | time | 3 | 3.485 |
2 | time | 1 | 0.2774 |
2 | time | 2 | 0.5253 |
2 | time | 3 | 3.697 |
Now the data are in long format, we can plot the points over time:
How does R know where to split the text?
In the example above, separate
split data like
"time_1"
, "time_2"
etc into two columns:
variable
and time
.
Q: How did it know to use the underscore (_
) to split
the data?
A: The default is to split on anything which is not a letter or
number. So _
or a space, or ,
would all
work.
Sometimes though we need to tell R explicitly what to use to sepatate the values.
If we had a column of email addresses we could split
ben.whalley@plymouth.ac.uk
into the username
(e.g. ben.whalley
) and domain name
(plymouth.ac.uk
) using the @
symbol.
To do this we just write sep="@"
when we use
separate.
The messy_exp
dataset in psydata
contains
simulated RT data on 100 participants in 2 conditions (A and B) at three
time points (1, 2, and 3).
- Use the
separate()
function to split up thecondition
variable in this dataset and draw the following plot:
- This file contains sample contact and address data for 100 people: https://letterhub.com/wp-content/uploads/2018/03/100-contacts.csv
Read the data into R (you can either use the URL above directly inside the
read_csv()
function, or download then re-upload the data to the server to do this)Use the
separate
function to make a new variable which contains the domain name of these contacts’ email address (e.g. yahoo.com, hotmail.com)
Note, you will need to use
sep="@"
to split the email addresses at the@
symbol
- Use the
distinct
and/orcount
functions on the new variable you create containing the domain name. Look them up in the help file if you don’t know which to use to answer these questions:
- How many people had a Gmail account?
- Which domains had more than 10 users?
# read the data directly from the URL
contacts <- read_csv('https://letterhub.com/wp-content/uploads/2018/03/100-contacts.csv') %>%
separate(email, into=c("user", "domain"), sep ="@") # uses the @ symbol as a separator
n |
---|
39 |
domain | n |
---|---|
gmail.com | 16 |
domain | n |
---|---|
aol.com | 13 |
gmail.com | 16 |
hotmail.com | 13 |
yahoo.com | 13 |
Questionnaire data
The file sweets.csv
contains a small number of example
rows of data exported from an online survey.
The file is at: https://is.gd/axenun
We can look at the first few rows of the data, using the
glimpse
command:
Rows: 4
Columns: 8
$ ID <dbl> 6, 7, 8, 9
$ `Start time` <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12…
$ `Completion time` <dttm> 2019-05-24 09:12:35, 2019-05-24 09:1…
$ Email <chr> "anonymous", "anonymous", "anonymous…
$ Name <lgl> NA, NA, NA, NA
$ `How much do you like sweets?` <chr> "I don't like them", "I'm neutral", "…
$ `How much do you like chocolate` <chr> "I don't like them", "I don't like th…
$ Gender <chr> "M", "F", "M", "F"
Import the sweets data as shown above from: https://is.gd/axenun
Save it to a new variable called
sweets
Tidying questionnaires
When we look at the imported data it’s useful to note:
There are extra columns we don’t need (at least for now).
Some of our variable names are very long and annoying to type (for example
How much do you like sweets?
is the name of one of our columns).Our responses are in text format, rather than as numbers. For example, the data say
"I don't like them"
or"I'm neutral"
rather than numbers from a 1-5 scale.
We need to sort each of these problems to make things more manageable for our analysis.
Selecting and renaming
Remember, R makes using columns with spaces or other special characters very hard. We want to avoid this.
Selecting
To use columns with spaces in we must ‘escape’ the spaces and let R know they are part of the name rather than a gap between two different names.
This video shows how (or read below):
To escape spaces and use columns with long names we use the backtick character (the backwards facing apostrophe) to wrap the column name.
In general, if your columns contain spaces or other odd characters like hyphens or question marks then you will need to wrap them in backticks.
Renaming
Some of the imported variable names in the sweets
data
are long and awkward to use.
Most researchers would rename these variables, to make them more usable in R code.
You can rename variables like this:
So for this example:
sweets %>%
rename(
like_sweets = `How much do you like sweets?`,
like_chocolate = `How much do you like chocolate`,
)
# A tibble: 4 × 8
ID `Start time` `Completion time` Email Name like_sweets
<dbl> <dttm> <dttm> <chr> <lgl> <chr>
1 6 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I don't like th…
2 7 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral
3 8 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I like them
4 9 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral
# ℹ 2 more variables: like_chocolate <chr>, Gender <chr>
Explanation of the code: We used rename
to change the names of our variables. We needed to wrap the long names
of the questions in ‘backtick’ symbols to make sure R understood it was
a single column name.
You should create a new variable to save the renamed dataset (with a descriptive name for use later on):
# create a new variable containing the renamed dataset
sweets.renamed <- sweets %>%
rename(
like_sweets = `How much do you like sweets?`,
)
Create a copy of the sweets data in which you have selected only the two columns with long names.
Create a second copy of the data where you have renamed the columns with long names to something short, and without spaces.
Renaming with the janitor
package
A good alternative to renaming variables manually is to use the
clean_names
function in the janitor
package.
Rows: 4
Columns: 8
$ id <dbl> 6, 7, 8, 9
$ start_time <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:3…
$ completion_time <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:…
$ email <chr> "anonymous", "anonymous", "anonymous",…
$ name <lgl> NA, NA, NA, NA
$ how_much_do_you_like_sweets <chr> "I don't like them", "I'm neutral", "I …
$ how_much_do_you_like_chocolate <chr> "I don't like them", "I don't like them…
$ gender <chr> "M", "F", "M", "F"
Explanation of the code and result. I used the
clean_names
function within the janitor
package without using library
. I did this by typing
janitor::
and then the name of the function. In the result
clean_names
has made a new dataset:
- Removed all special characters
- Made everything lower case (easier for R to autocomplete)
- Replaced spaces with underscores
- Made column names unique (this isn’t always the case with imported data, but is important for R)
I typically use this function when importing any new data because it makes the naming and access of columns much more consistent and easier to remember.
Recoding text
We noticed above that our responses were stored as text labels like
"I don't like them"
rather than on a numeric scale. This
makes it hard to use in an analysis.
We need to recode the text variables into numeric versions.
How to do it
First we must tell R what number we want to use for each text label. That is, we create a mapping of numbers to labels.
This takes a few steps:
- Check exactly what the text values are which need to be mapped.
- Make a mapping variable which assigns each text value a number value
- Use the
recode
function withmutate
to create a new, numeric column
This video walks you through the steps below:
Step 1: Check EXACTLY what text labels we have
To check which labels we need to recode, I select the column in
question and use the unique()
function.
# check exactly what text values are in the dataset?
sweets %>%
select(`How much do you like sweets?`) %>%
unique()
# A tibble: 3 × 1
`How much do you like sweets?`
<chr>
1 I don't like them
2 I'm neutral
3 I like them
Step 2: Make a mapping variable
We do this by creating what R calles a named vector, which is a special kind of list.
To make a named vector we use the the c()
function. The
letter c
here just stands for ‘combine’ — i.e. ‘combine
these things into a list’.
This is a simple example:
We could then use this mapping to recode a column of data which contained the words “No” or “Yes
A useful trick when creating your own mappings is to use R to do the formatting for you (see the video above for a demo).
Re-using the code from the previous step, we use
unique()
to show us the unique values for the
question about sweets.
We then pipe the result to the paste()
and
cat()
functions, like this:
# the hack we use as a short-cut to creating a mapping variable
sweets %>%
select(`How much do you like sweets?`) %>%
unique() %>%
paste() %>% cat()
c("I don't like them", "I'm neutral", "I like them")
Explanation of the output: Using paste
and cat
is a bit of a hack. When we run this code we see
the output
c("I don't like them", "I'm neutral", "I like them")
. This
is a list of the values in the sweets
data for this
question, formatted in a way that will be useful to us in the next
step.
We then copy and paste this output into a NEW code block, and EDIT it to assign our mappings:
Explanation of the code: We used the previous output
to create a mapping. By adding the parts which read = -1
and = 0
etc, we have told R what value we want to assign
for each label.
Q: How do you know what number values to assign?
A: It doesn’t matter, provided:
- The intervals between each options are the same and
- Each text value has a different number
So, if we had a Likert-scale ranging from “Completely agree” to
“Completely disagree” in 7 increments, we could score this from
0 - 6
or 1 - 7
, or -3 - 3
. These
would all be fine.
Step 3: Use the mapping variable to recode the column
We can use our new mapping with the mutate
and
recode
functions to make a new column,
containing numbers rather than text:
sweets.recoded <- sweets %>%
rename(
like_sweets = `How much do you like sweets?`,
like_chocolate = `How much do you like chocolate`,
) %>%
# use recode to convert text response using preference.mappings
mutate(
like_sweets_numeric =
recode(like_sweets, !!!preference.mappings)
)
We can see this new column if we use glimpse
:
Rows: 4
Columns: 9
$ ID <dbl> 6, 7, 8, 9
$ `Start time` <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-…
$ `Completion time` <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05…
$ Email <chr> "anonymous", "anonymous", "anonymous", "anonymous"
$ Name <lgl> NA, NA, NA, NA
$ like_sweets <chr> "I don't like them", "I'm neutral", "I like them"…
$ like_chocolate <chr> "I don't like them", "I don't like them", "I'm neu…
$ Gender <chr> "M", "F", "M", "F"
$ like_sweets_numeric <dbl> -1, 0, 1, 0
Explanation of the code:
- The start of the first line is
sweets.recoded <-
which means make a new variable calledsweets.recoded
. - Then we use
mutate
to create a new column calledlike_sweets_numeric
. - We make this column using
recode
on the question about liking sweets. - We use the
preference.mappings
mapping to specify what numeric score to give each of the text values.
Watch out for the exclamation marks!!!
:
In the code above there are three exclamation marks, !!!
,
before the mapping; make sure you do the same.
Three exclamation marks: !!!
Understanding this isn’t necessary to get on with the course. Only read this if you are interested!
In the code above when we used recode
we used three
exclamation marks just before our list.
We defined the mapping:
likert.responses <- c(
"I hate them" = 1,
"I don't like them" = 2,
"I'm neutral" = 3,
"I like them" = 4,
"I can't live without them" = 5)
And then used it with recode
, with the three exclamation
marks.
liking_of_sweets_data %>%
mutate(like_sweets_numeric = recode(like_sweets_text, !!!likert.responses)) %>%
pander()
The reason for this is that recode
actually expects us
to specify the mapping for it like this:
liking_of_sweets_data %>%
mutate(like_sweets_numeric = recode(like_sweets_text,
"I hate them" = 1,
"I don't like them" = 2 ...))
But this means we have to repeat the mapping for each of the questions. Because all the questions use the same mapping this gets repetitive, and can lead to errors.
The three exclamation marks !!!
unpacks the
list for us. So writing !!!likert.responses
saves us the
bother of writing it out in full each time.
Summary/video explanation
This is one of the trickiest bits of R code we use. I’ve included an annotated video of just these steps as a reference below:
Use this three-step process to create a recoded version of the
like_chocolate
variable.Remember to watch the video at the start of this section, or the short version in the green box above, if anything is unclear.
Combining scores
Often questionnaires are designed to make repeated measurements of the same phenomena, which can then be summed or averaged to create a more reliable measure.
We’ve already seen how mutate()
creates a new column. We
can use this again to create the sum of responses to both
questions:
sweets.recoded %>%
# mutate to create column containing sum of both recoded questions
mutate(liking = like_sweets_numeric + like_chocolate_numeric) %>%
# select the new variable we created
select(ID, liking) %>%
pander()
ID | liking |
---|---|
6 | -2 |
7 | -1 |
8 | 1 |
9 | 0 |
Explanation of the code:
- We added both the questions which asked about ‘liking’ together.
This created a new column containing the combined score, called
liking
. - In the third line we selected only the
ID
column, plus the new column we made.
Using ChatGPT to automate things
In the sections above I explained how to tidy up a simple dataset. However it turns out many of these tasks are easily solved using ChatGPT.
You now have a choice you didn’t have a few years ago:
- Learn how to do these steps by hand (and sometimes use ChatGPT to automate the work)
- Learn just enough to be able to get ChatGPT to write code for you (and hope it’s correct)
I would recommend option 1 (at least until LLMs make improvements in code generation) because it will likely be faster, more accurate and will promote your understanding.
But as a guide, here’s how I might solve this problem with ChatGPT as an assistant.
Step 1. Show ChatGPT the data
I use these command to output some information about the data, in a format that’s easy to copy into ChatGPT.
Step 2: Ask GPT to recode it for us
This link shows how I asked ChatGPT to recode the sweets
data:
https://chat.openai.com/share/325b1f77-c68c-417a-8e75-76b05a6b9633
Note that I:
- told it the names of the columns
- showed it a sample of the data (pasted in the output of
glimpse
) - explicitly listed the values in the columns I wanted to recode
(pasted in output of
count
) - directly asked for R code
- gave instructions to use the tidyverse (this makes the output more consistent with what you have learned)
Step 3: Debug and validate the R code GPT provides
In this example, ChatGPT provided the following code:
library(tidyverse)
# Load the sweets dataset (assuming it's already loaded)
sweets <- read_csv('https://is.gd/axenun')
# Recode the text responses into numeric variables
sweets <- sweets %>%
mutate(Like_sweets = recode(`How much do you like sweets?`,
"I don't like them" = 0,
"I like them" = 1,
"I'm neutral" = 2
)) %>%
mutate(Like_chocolate = recode(`How much do you like chocolate`,
"I don't like them" = 0,
"I like them" = 1,
"I'm neutral" = 2
)) %>%
select(-`How much do you like sweets?`, -`How much do you like chocolate`)
# Rename variables with spaces to use underscores and be shorter
sweets <- sweets %>%
rename(
ID = `ID`,
Start_time = `Start time`,
Gender = `Gender`
)
# Remove timestamp and email columns
sweets <- sweets %>%
select(-`Start_time`, -`Email`)
# Check the modified dataset
glimpse(sweets)
Amazingly, this actually works!
However it’s worth saying that it often doesn’t work on first try, and it is dependent on writing a good prompt to the model, and giving it plenty of information about the data and what you want to achieve.
Some tips:
- Think first. Be clear about what you want to achieve.
- Explain in detail what the data are like and show examples.
- Give explicit instructions.
- If code doesn’t work, ask for explanations or corrections.
- Break tasks down into smaller steps (ask for less and build up your script).
- Paste in the code you already have and ask for a completion with one extra step (e.g. something you don’t know how to do).
- If you can afford it, use ChatGPT 4 or try new R-specific LLMs as they are (inevitably) released.
Where there are bugs in ChatGPT’s code, you can
- fix these bugs by hand, or
- run the code, report the errors to ChatGPT and ask it to fix them for you.
The second option has some risks (ChatGPT’s code might delete your data, although it’s unlikely), and is probably slower. In this instance it took me quite a while to get ChatGPT to fix the simple errors, and this required understanding the code to some degree anyway.
As I recommend in the lectures, knowing how to use R well enough to fix the small errors introduced is likely to be more productive in the long run. Think of LLMs as your assustant rather than your replacement.
Consolidation activity
Use this example dataset: https://is.gd/ayinof
- Read in the data.
- Rename the long column names to something short, and without spaces
- Recode at least three of the columns with data about sleep quality to be numeric
- Save the result of this work in a new variable called
sleep.tidy
- Pivot the recoded variables and make a boxplot of responses to these questions
- Create a summary score called
sleep_quality
which is the sum of these recoded questions (use mutate) - Create a density plot of this summary score and interpret what you see (describe the pattern in plain English)
sleep <- read_csv('https://is.gd/ayinof')
# used to check what response values are in each question
sleep %>%
select(`My sleep is affected by my study commitments`) %>%
unique() %>% paste %>% cat
c("Agree", "Somewhat agree", "Somewhat disagree", "Disagree", "Neither agree nor disagree", "Strongly agree", "Strongly disagree")
sleep %>%
select(`My electronic device usage negatively affects my sleep`) %>%
unique() %>% paste %>% cat
c("Disagree", "Strongly agree", "Somewhat disagree", "Agree", "Somewhat agree", "Neither agree nor disagree", "Strongly disagree")
# we will use the same mapping for both questions because they have the same responses
sleep.map <- c("Agree"=2,
"Somewhat agree"=1,
"Somewhat disagree"=-1,
"Disagree"=-2,
"Neither agree nor disagree"=0,
"Strongly agree"=3,
"Strongly disagree"=-3)
sleep.tidy <- sleep %>%
# now we recode the two text variables (we only need use mutate once though)
mutate(
sleep_study = recode(`My sleep is affected by my study commitments`, !!!sleep.map),
sleep_electronic = recode(`My electronic device usage negatively affects my sleep`, !!!sleep.map)
)
# now we can pivot longer to make a plot
sleep.tidy %>%
pivot_longer(c(sleep_study, sleep_electronic)) %>%
ggplot(aes(name, value)) + geom_boxplot()
And make a summary score combining both questions
sleep.tidy.withsumary <- sleep.tidy %>%
# and create the summary score
mutate(sleep_quality = sleep_study + sleep_electronic )
# check the result. it looks ok
sleep.tidy.withsumary %>% glimpse
Rows: 241
Columns: 13
$ uniqueid <chr> …
$ `Start time` <dttm> …
$ `Completion time` <dttm> …
$ `My sleep is affected by my study commitments` <chr> …
$ `I achieve good quality sleep` <chr> …
$ `My electronic device usage negatively affects my sleep` <chr> …
$ `Tiredness interferes with my concentration` <chr> …
$ `My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children...` <chr> …
$ `I often achieve eight hours of sleep` <chr> …
$ `I regularly stay up past 11pm` <chr> …
$ sleep_study <dbl> …
$ sleep_electronic <dbl> …
$ sleep_quality <dbl> …
Check your knowledge
What does it mean for data to be “tidy”? Identify the three key characteristics
What function creates a new column in a dataset?
Which function allows you to choose a subset of the columns in a dataset?
What does
pivot_longer
do?Why is
pivot_longer
useful when we want to make a faceted plot?Why is wide data sometimes more useful than long data?
If you have RT data with 10 groups, what tidyverse ‘verbs’ (functions) would you use to calculate the mean for each group?
How can you read data into a dataframe from over the internet?
Does
recode
convert from text to numeric, or from numeric to text values?Why is it important to recode text variables into numeric values when working with survey or questionnaire data?
Why is it important to copy and paste exact values when making a mapping variable for
recode
?When renaming, is it:
rename(oldvarname=newvarname)
orrename(newvarname=oldvarname)
?What does the
clean_names
function from the janitor package do?Imagine a single cell of your dataset contains the string “conditionA_time_1”. What function should we apply to it?