Working with groups

Ben Whalley, Paul Sharpe, Sonja Heintz

Overview

As researchers, we often collect data from two or more groups.

Group allocations are categorical variables. They are stored in a special way by R, which makes it easier to display them on graphs and use them in analyses.

In this workshop we learn to how to use categorical variables in boxplots, to add colour to scatterplots, and to make tables of descriptive statistics which you will need in your coursework and project reports.

Making boxplots

  • Boxplots are useful for comparing categorised data
  • Use aes() with ggplot() to choose a categorical column as the x-axis
  • geom_boxplot() draws boxes and whiskers for each category
  • The box describes the interquartile range
  • The midpoint is the median
  • Individual points show outliers in the data
iris %>%
  ggplot(aes(x=Species, y=Petal.Length)) +  # specify dataset columns 
  geom_boxplot()                            # add the boxes

Boxplots are useful for visualising differences between categories or groups.

Using ggplot(), making a boxplot is similar to a scatterplot.

If we haven’t already done it, we’d need to load the tidyverse:

library(tidyverse)

To make the plot, we first choose the data we want to plot and add a pipe to the end of the line to send this data to the next command:

  • From the code below select only first line
  • Also show selecting/executing just the name of the dataset and how this displays the data
  • Emphaise the importance of breaking code down and running sub-steps separately

Next we use ggplot() and aes(x = ..., y = ...) to select the columns we want to use in the data. aes is short for aesthetics which means ‘something you can see’. So the aes() function defines what will be able to see on the plot — the axes, and other features like colour.

This time we have chosen Species for the x-axis, because it is a categorical variable.

iris %>%
  ggplot(aes(x = Species, y = Petal.Length))

  • Demonstrate running code above
  • Also show using autocomplete in RStudio when writing the code above

You’ll notice that I used the RStudio autocomplete feature to write function and column names from our dataset. This makes things much easier — especially for column names. If you type column names by hand it’s easy to make spelling errors or typos, and this leads to errors in R.

If we run the code so far, ggplot() draws the plot axes, but doesn’t add the data yet:

Run the code so far

So far we have said what we want to show on our plot, but not how it should be shown.

To finish, we add geom_boxplot() which actually draws the boxes:

iris %>%
  ggplot(aes(x = Species, y = Petal.Length)) +
  geom_boxplot()

Note that I used a + symbol to add the boxplot layer to our graph.

So, now we have a boxplot, with one box drawn for each value of Species.

Interpreting boxplots

In a boxplot:

  • The thick line across the middle of the box is the median or midpoint of the data

  • The height of the box indicates the interquartile range (IQR). This means the box contains 50% of the datapoints. A wider IQR indicates greater variation (spread) in a dataset.

  • The meaning of the “whiskers” (the lines above the top, and below the bottom of the box) varies a little bit depending on the software you use — but they always show the range/spread of the data. The default in ggplot() are whiskers with lengths covering data points no more, or less than 1.5 times the IQR.

  • Any data point outside the range of the whiskers is described as an ‘outlying point’ or ‘outlier’. Each outlier is plotted individually as a dot.

Exercise 1

  1. Open session-3.rmd using the Files pane. This is the workbook you will be using in this session.
  2. Use the iris dataset to create a boxplot with Species on the x-axis and Sepal.Width on the y-axis (sepals are the leaves that encase an iris flower).

Your boxplot should look like this:

Adding colour to a plot

  • The points of a scatterplot can be coloured based on a third column of data
  • We can colour points using categorical or continuous data
  • The aes(..., colour=column_name) option selects the column to use (change column_name to the name of your column)
# scatterplot of engine size (displ) against highway fuel efficiency (hwy)
mpg %>%
  ggplot(aes(displ, hwy)) +
  geom_point()

# colour each point; different colour for each type of transmission
# front, rear and four wheel drive
mpg %>%
  ggplot(aes(displ, hwy, colour=drv)) +
  geom_point()

As we saw before, scatterplots show the relationship between two variables.

Using the mpg data, we could plot the size of car engines (displ, short for displacement) against their fuel efficiency on the highway in miles per gallon (hwy):

mpg %>%
  ggplot(aes(displ, hwy)) +
  geom_point()
  • Run code to generate plot

NOTE: This first plot is not shown here to save space, but is shown in the video.

We can add colour to distinguish the points in this plot — for example, what kind of car each point represents.

So far, you’ve used the aes() function to define which variable is plotted on the x and y axes. The aes() function is short for ‘aesthetics’ (what we can see) and connects columns in our data to visual aspects of a plot, like the x and y axes.

The colour=... option adds to this, and says which column is used to colour the points.

In the mpg data, the drv column contains one of three values. The value categorises cars based on which wheels drive the engine. An f in drv means a front wheel drive car, r means rear wheel drive, and 4 means four wheel drive.

We can write colour=drv to tell R to colour each point, depending on which wheels are driven:

  1. Add colour=drv
  2. Run code to generate plot
# colour each point; different colour for each type of transmission
# front, rear and four wheel drive
mpg %>%
  ggplot(aes(displ, hwy, colour=drv)) +
  geom_point()

Notice that ggplot() automatically adds a key to the plot, mapping the colour of the points to the categories in drv.

This is an example of using a categorical variable to colour our points.

Exercise 2

  1. Use the development data from the psydata package (run library(psydata) first if you haven’t already).
  2. Create a scatterplot with life_expectancy on the x-axis (along the bottom) and gdp_per_capita on the y-axis.
  3. Add colour to this scatterplot: make each continent a different colour.
  4. Run the chunk of code.

Your plot should look like this:

Using different types of data and visual scales

  • Continuous, categorical and text data are all common in psychology
  • Internally, R stores columns of data using different data types.
  • R uses data types as a clue to set defaults (e.g. for graphs)
  • Sometimes we need to convert between types
  • For example, sometimes categorical data is (wrongly) stored as a numeric column
# always load the libraries we need first
library(tidyverse)
library(psydata)

# see a list of columns in the dataset, and their types
iris %>% glimpse
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, …
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, …
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, …
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, …
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
# make a scatter plot with two continous axes
# both wt and mpg are numeric variables
# so this works well
fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()   

# try and make a boxplot with `am` as the x-axis
# but because `cyl` is stored as a numeric variable
# (not categorical) the scale of the x-axis is wrong
fuel %>%
  ggplot(aes(cyl, mpg)) +
  geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

# re-draw the plot, but converting `cyl` to a factor/categorical
# variable first. Now the x-axis looks correct
fuel %>%
  ggplot(aes(factor(cyl), mpg)) +
  geom_boxplot()

Ok this is a slightly longer video. In the first bit we introduce some background to the term variable, and the way in which R stores different types of data. In the final part we see why this is important when we’re making plots (and running other analyses).

I’m especially going to talk about:

  • the different types of variable we may have in our study (e.g. continuous, categorical, text)
  • the way R stores these different data (e.g. as numeric, factor or character data) and
  • the way that ggplot() presents these different data types

These might seem like small details — but having some understanding of what is happening will help later on: you will be able to control how your data are presented in tables and plots. It will also help you make sure your data are in the right format for other statistical tests.

Columns vs variables

This is really important. Make sure you are clear on the different meanings of the word ‘variable’

The word variable gets used in at least 4 different ways in quantitative research, and this confuses a lot of people.

We can’t avoid all ambiguity because these different usages are common in the field, but it helps to know about them in advance. The main usages are:

  • variables in a theoretical model. That is, things or constructs we think exist, and which cause other things to happen. An example might be “empathy”, or “working memory”. These are psychological constructs which many theories include, and those theories say these constructs play a causal role in how we behave or perform on certain tasks.

  • variables in a study design. An example here might an experimental group allocation, or an attribute of participants like their age or gender.

  • variables in a dataset. Sometimes people want to refer to a column of numbers in a dataset (e.g. a spreadsheet), where the column has a name, and use the term variable again. This usage overlaps with the previous two, but it doesn’t have to. For example, we might have a column of numbers recording the date and time participants’ completed an empathy questionnaire. Analysed a certain way these timestamps might tell us something useful, but the column of timestamps isn’t a variable in our theoretical model, and they might not be used as a variable in our experiment.

[ Finally, we have ]

  • variables in R and computer programs: This type of variable is a general purpose container which can store anything. Variables in R can contain columns of numbers, but they can also contain whole datasets, or graphs, or the results of statistical tests. A variable in R often isn’t the same thing as an experimental variable or a column in a dataset.

In this short course use variable to mean either an experimental/theoretical variable or an R container.

When we mean a column of data in a dataset we will use the word column instead.

Types of variable

OK so let’s focus for a minute on variables in a study design.

There are a few main types of these variable to consider. You might be familiar with some of these terms already:

  • categorical or nominal variables (sometimes also called factors): e.g. gender, education status

  • binary variables (a special category that is either TRUE or FALSE): e.g. smoker/non-smoker

  • ordinal variables (categorical responses where the order is important): e.g. a 1-7 response on a Likert-style question

  • interval or continuous variables: e.g. weight in kg, time in milliseconds

  • count variables (a frequency, measured in whole numbers): e.g. the number of social contacts per week

  • text variables: e.g. free text responses to open-ended questionnaire items

Types of column

OK, so those are the types of variable.

We can also think about the types of column that our dataset stores.

Datasets have multiple columns, each with a unique name.

In R, every column has a data type. These data types are normally determined by the type of variable. But there is some overlap, and different data types could be used to store the same variable.

We sometimes need to convert between data types.

There are four data types you need to know about in R: factors, text, numeric and logical columns.

  • Factors, which are used for categorical data (and sometimes ordinal data too)
  • Text, also called character or string data.
  • Numeric columns can called be either integer or double. Double is computer speak for ‘double-precision number’, which just means decimals and really big numbers are allowed
  • Logical, which can be used to store values that can only be either True or False.

The important thing to know is that— although different types of variable are usually stored in different types of column in R— this isn’t always the case.

Sometimes data can be stored as the wrong type—e.g. categorical data could end up stored as numeric data — and in these cases we need to convert from one format to another.

You can see which data type is used to store a variable when using the glimpse() command introduced in Session 1 (see here).

iris %>% glimpse()
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, …
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, …
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, …
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, …
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, …

In the output you can see the variable names listed on the left, followed by text in angle brackets, e.g.: <dbl> which is the abbreviated name of data type.

In this built-in dataset, most of the data is numeric (dbl), but the Species variable is categorical, and stored as a factor (fct).

It’s possible to convert columns from one type to another, to suit our needs. We’ll see more of this later.

Data/column type Useful for Abbreviations used/subtypes Often need to convert from
Factor categorical, ordinal fct, ord text
Text text, categorical chr categorical
Numeric continuous, ordinal int, dbl categorical; text
Logical binary (boolean) lgl numeric

Column types determine the scales on graphs

If we look at the fuel data (from psydata) we can see that most of the columns are stored as numeric data (dbl):

fuel %>% glimpse()
Rows: 32
Columns: 7
$ mpg         <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.…
$ cyl         <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, …
$ engine_size <dbl> 2620, 2620, 1770, 4230, 5900, 3690, 5900, 2400, 2310, 2750, 2750, 4520, 4520, 452…
$ power       <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230,…
$ weight      <dbl> 1188, 1304, 1052, 1458, 1560, 1569, 1619, 1447, 1429, 1560, 1560, 1846, 1692, 171…
$ gear        <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, …
$ automatic   <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …

This is fine if we want to make a scatter plot.

In the example below we have ‘miles per gallon’ vs the weight of cars in the fuel dataset (in psydata):

fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()       

In this example of a scatterplot both the x and y axes are continuous. That is, we expect them to be real numbers, and they are stored as numeric columns.

However, if we want to make a boxplot of the mpg column using cyl (number of cylinders) as the x axis then we have a problem:

fuel %>%
  ggplot(aes(cyl, mpg)) +
  geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

The warning message says Continuous x aesthetic -- did you forget aes(group=...)?. This is a clue as to what has gone wrong.

We might have expected to see;

  • miles per gallon on the \(y\) axis
  • three separate boxes, one each for 4, 6 and 8 cylinders

It didn’t work as expected though.

R spots that cyl is stored as numeric data, and because it doesn’t know better it creates a _continuous scale) on the \(x\) axis.

Only one box is drawn at the midpoint of all the values of cyl; because cyl ranges from 4 to 6 the box appears at 6.

Although R often has good defaults, it isn’t smart — if it doesn’t behave as we expect we need to be more specific.

Here we want to use cyl as a categorical variable, so we need to give R precise instructions. We should convert cyl to a factor explictly. Then our plot will work properly.

We can use the command factor(cyl) to tell R that the \(x\)-axis is a factor:

fuel %>%
  ggplot(aes(factor(cyl), mpg)) +
  geom_boxplot()

This gives us the boxplot we were expecting.

The only change was to replace cyl with factor(cyl). This tells R to convert the variable cyl to a factor, and ggplot() can then set the scale of the x-axis correctly.

Exercise 3

Work with a friend: Describe the 4 ways in which quantitative researchers might use the word ‘variable’?

(If you need to look these up from the video or text above then try testing yourself again after completing other exercises).

Exercise 4

Use glimpse() to check the data types of the mpg and the diamonds datasets.

  • The data type of the 4th variable in the mpg data is
  • The data type of the 5th variable in the diamonds data is

Exercise 5

Use the fuel dataset to make a boxplot showing miles per gallon on the y-axis, and number of gears (gear) on the x-axis. Your plot should look like this:

Remember that you will need to change the type of the gear variable for this to work.

Using group_by() to make a summary table and compare groups

  • Datasets often contain categorical variables
  • We often want to compare statistics (like averages) between categories
  • The group_by() function is a quick way to combine filtering and summarising
  • group_by() creates a grouped data.frame
  • Adding group_by() to a pipeline runs the subsequent steps once for each group.
  • The result is always a new data.frame
# check the columns in the funimagery dataset
# results of an RCT of functional imagery training for weight loss
# conducted at the University of Plymouth
funimagery %>% glimpse
Rows: 112
Columns: 8
$ gender              <chr> "f", "f", "f", "f", "f", "f", "f", "m", "f", "f", "m", "f", "f", "f", "m"…
$ age                 <int> 44, 32, 33, 21, 27, 56, 50, 57, 34, 25, 70, 56, 55, 43, 37, 45, 60, 51, 2…
$ kg1                 <dbl> 107.8, 107.0, 99.5, 80.0, 81.0, 59.0, 95.0, 90.0, 87.0, 121.2, 84.0, 100.…
$ kg2                 <dbl> 106.7, 105.4, 101.0, 79.0, 80.0, 57.0, 92.0, 87.0, 87.0, 123.0, 81.9, 97.…
$ kg3                 <dbl> 106.0, 105.9, 98.8, 78.0, 80.0, 60.0, 92.0, 87.0, 86.4, 119.7, 84.0, 98.0…
$ person              <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,…
$ intervention        <fct> MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, M…
$ weight_lost_end_trt <dbl> -1.1, -1.6, 1.5, -1.0, -1.0, -2.0, -3.0, -3.0, 0.0, 1.8, -2.1, -2.8, -2.4…
# boxplot to compare the intervention groups
funimagery %>%
  ggplot(aes(intervention, weight_lost_end_trt)) +
  geom_boxplot()

# calculate mean weight loss in each group in a laborious way
funimagery %>%
  filter(intervention=="MI") %>%
  summarise(mean(weight_lost_end_trt))

funimagery %>%
  filter(intervention=="FIT") %>%
  summarise(mean(weight_lost_end_trt))


# use group_by to split the data and summarise each group
funimagery %>%
  group_by(intervention) %>%
  summarise(mean(weight_lost_end_trt))

# store the result (also a data.frame) in a new variable
average_weight_loss <- funimagery %>%
                          group_by(intervention) %>%
                          summarise(mean(weight_lost_end_trt))


# example of grouping by two columns at once
funimagery %>%
  group_by(gender, intervention) %>%
  summarise(mean(weight_lost_end_trt))



# calculate the mean and SD in one go
funimagery %>%
  group_by(intervention) %>%
  summarise(
    mean(weight_lost_end_trt),
    sd(weight_lost_end_trt)
  )


# how to give your new summary columns a name
# (this is good practice)
funimagery %>%
  group_by(intervention) %>%
  summarise(
    mean_weight_lost_end_trt = mean(weight_lost_end_trt),
    sd_weight_lost_end_trt = sd(weight_lost_end_trt)
  )

In this video we’ll use the funimagery dataset, which is part of the psydata package. This contains data from a randomised controlled trial run in Plymouth (Solbrig et al., 2019).

The study compared two treatments for weight loss: Functional Imagery Training (FIT) and Motivational Interviewing (MI).

We can see the columns in the dataset using glimpse():

  • Run code
funimagery %>% glimpse
Rows: 112
Columns: 8
$ gender              <chr> "f", "f", "f", "f", "f", "f", "f", "m", "f", "f", "m", "f", "f", "f", "m"…
$ age                 <int> 44, 32, 33, 21, 27, 56, 50, 57, 34, 25, 70, 56, 55, 43, 37, 45, 60, 51, 2…
$ kg1                 <dbl> 107.8, 107.0, 99.5, 80.0, 81.0, 59.0, 95.0, 90.0, 87.0, 121.2, 84.0, 100.…
$ kg2                 <dbl> 106.7, 105.4, 101.0, 79.0, 80.0, 57.0, 92.0, 87.0, 87.0, 123.0, 81.9, 97.…
$ kg3                 <dbl> 106.0, 105.9, 98.8, 78.0, 80.0, 60.0, 92.0, 87.0, 86.4, 119.7, 84.0, 98.0…
$ person              <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,…
$ intervention        <fct> MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, M…
$ weight_lost_end_trt <dbl> -1.1, -1.6, 1.5, -1.0, -1.0, -2.0, -3.0, -3.0, 0.0, 1.8, -2.1, -2.8, -2.4…

Weight was measured three times, and these measurements are in the columns called kg1, kg2 and kg3.

  • The first measurment (kg1) was taken at baseline, when participants joined the study
  • The second measurement (kg2) was taken at the end of treatment, which was 6 months after baseline
  • The final measurement (kg3) was made 12 months after baseline

The weight_lost_end_trt column is the difference between kg1 and kg2 – i.e. how much weight the participant lost after their treatment.

We have already seen a that a boxplot can compare scores between categories:

  • Generate plot
funimagery %>%
  ggplot(aes(intervention, weight_lost_end_trt)) +
  geom_boxplot()

This plot is helpful, and makes the differences really clear. But what if we want the actual numbers in a table, or to report in the text of our paper?

Using filter() and summarise()

One option would be to filter our data first and then summarise each subset in turn:

  • Run code
funimagery %>%
  filter(intervention=="MI") %>%
  summarise(mean(weight_lost_end_trt))

This output shows us the mean for the MI group. We then have to repeat that code for the FIT group:

  • Run code
funimagery %>%
  filter(intervention=="FIT") %>%
  summarise(mean(weight_lost_end_trt))

This is a bit repetitive, and would worse if there were lots of categories. Imagine doing something similar for each continent in the development data, for example.

Using group_by()

Instead of using filter(), we can use group_by() to split our dataset into multiple groups, summarising each one separately:

  • Run code
funimagery %>%
  group_by(intervention) %>%
  summarise(mean(weight_lost_end_trt))

Using group_by() before summarise() splits up the data, in thise case depending on which intervention the participant received.

  • Highlight group_by()

The summarise() function calculates the mean for each group separately.

  • Highlight summarise()

The result is a new data.frame with two columns: The name of the intervention, and the average weight lost. We could save this in a new variable for use later, if we liked:

average_weight_losses <- funimagery %>%
                          group_by(intervention) %>%
                          summarise(mean(weight_lost_end_trt))

You can see the new variable in the Environment, so it’s stored for later.

Nested groups

Sometimes we might want to group by more than one variable at once.

Imagine we wanted to see if men and women responded differently to the two treatments?

With group_by() this is really easy: just add the name of another column to group on (gender), separated with a comma:

funimagery %>%
  group_by(gender, intervention) %>%
  summarise(mean(weight_lost_end_trt))

It looks like women lost a bit more weight on average

  • Highlight f rows

but both men and women lost more weight after FIT relative to MI.

  • Highlight MI and FIT rows for f, then m

Multiple statistics

With summarise() we can calculate multiple statistics at once for each of the groups:

  • Run code
funimagery %>%
  group_by(intervention) %>%
  summarise(mean(weight_lost_end_trt), sd(weight_lost_end_trt))

Here, we’ve calculated both the mean and standard deviation for each intervention.

Give your new variables a name

R gives the new columns a name based on the function we use to summarise() them.

For example, if we use mean() on the weight_lost_end_trt variable then the new column is called mean(weight_lost_end_trt).

  • Rerun code above and highlight where the name of the variable comes from

Unfortunately, these new variable names contain parentheses and (sometimes) spaces which makes them harder to work with. It’s much better to give the new column a specific name like this:

funimagery %>%
  group_by(intervention) %>%
  summarise(mean_weight_lost_end_trt = mean(weight_lost_end_trt), sd_weight_lost_end_trt = sd(weight_lost_end_trt))

Remember, the column names you use shouldn’t include spaces or other ‘special’ characters.

Exercise 6

Replace the ? in the code below to to calculate the median weight lost for men and women undergoing FIT and MI in the funimagery dataset.

? %>%
  group_by(?, ?) %>%
  summarise( ?(weight_lost_end_trt) )

These are the correct numbers to check your work against:

Gender Intervention Median weight loss
Female MI -1.3
Female FIT -4.6
Male MI -2.1
Male FIT -4.3

Exercise 7

Use group_by() and summarise() with the built-in iris dataset to calculate the mean Sepal.Length for each Species of flower.

These are the correct numbers to check your work against:

Species Mean sepal length
setosa 5.0
versicolor 5.9
virginica 6.6

Exercise 8

The built-in dataset chickwts contains weights of chicks (in grams) fed on different diets. Use group_by() and summarise() to calculate the mean and standard deviation chick weights for each type of feed.

The mean weight of chicks fed on linseed was g.

The standard deviation of chicks fed on sunflower was g.

Check your knowledge

Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 4.

  1. Which functions are needed to make a boxplot?
  2. What is the difference between a dbl and a fct or ord?
  3. Which data types are used for continuous variables?
  4. Give an example of why the difference between dbl and fct matters when making a plot (include code examples for this if you can)
  5. How can you convert a variable from a dbl to a fct?
  6. How could you calculate the mean for one level of a factor?
  7. How would you calculate the mean for all levels of a factor (e.g. for continent in the development dataset)?
  8. How would you calculate means and standard deviations for all combinations of levels in two factors?

Extension exercises

Please remember that these extension exercises are not required to pass the course. We include them because some students work through these materials much more quickly than others — often because they have more previous experience with programming — and we aim to give all students the opportunity to stretch their skills.

If you do find you have extra time these exercises are intended to provide additional practice in the techniques covered and to be useful preparation for using R independently in a stage 4 or MSc research project.

Extension exercise 1

Make a scatterplot of the diamonds data. Show carat on the x-axis, price on the y-axis, and the clarity of the diamonds in colour. Try to produce your plot before comparing it against the answer using the button below.

Interpret your plot: Which category of diamond clarity has the steepest rise in price as size (carat) increases?

Extension exercise 2

Make a scatterplot of the fuel dataset. Show mpg on the y-axis, engine size on the x-axis, and use type of transmission (auto/manual) to colour the points. Try to produce your plot before comparing it against the answer using the button below.

Interpret your plot: Which type of car (auto or manal) sees the strongest relationship between engine size and fuel economy?

Extension exercise 3

Using the development dataset, make a boxplot showing life expectancy by continent for years greater than 1999. (Hint: use filter(), ggplot() and geom_boxplot().)

The plot should look like this:

Extension exercise 4

Try to recreate the plot below using the mtcars built-in dataset. Remember to use factor() for columns which require conversion to fct.

Extension exercise 5

Try to recreate the table below using the painmusic dataset from psydata.

References

Solbrig, L., Whalley, B., Kavanagh, D. J., May, J., Parkin, T., Jones, R., & Andrade, J. (2019). Functional imagery training versus motivational interviewing for weight loss: A randomised controlled trial of brief individual interventions for overweight and obesity. International Journal of Obesity, 43(4), 883–894.