Overview
As researchers, we often collect data from two or more groups.
Group allocations are categorical variables. They are stored in a special way by R, which makes it easier to display them on graphs and use them in analyses.
In this workshop we learn to how to use categorical variables in boxplots, to add colour to scatterplots, and to make tables of descriptive statistics which you will need in your coursework and project reports.
Making boxplots
- Boxplots are useful for comparing categorised data
- Use
aes()
withggplot()
to choose a categorical column as the x-axis geom_boxplot()
draws boxes and whiskers for each category- The box describes the interquartile range
- The midpoint is the median
- Individual points show outliers in the data
Boxplots are useful for visualising differences between categories or groups.
Using ggplot()
, making a boxplot is similar to a
scatterplot.
If we haven’t already done it, we’d need to load the
tidyverse
:
To make the plot, we first choose the data we want to plot and add a pipe to the end of the line to send this data to the next command:
- From the code below select only first line
- Also show selecting/executing just the name of the dataset and how this displays the data
- Emphaise the importance of breaking code down and running sub-steps separately
Next we use ggplot()
and
aes(x = ..., y = ...)
to select the columns we want to use
in the data. aes
is short for aesthetics which
means ‘something you can see’. So the aes()
function
defines what will be able to see on the plot — the axes, and other
features like colour.
This time we have chosen Species
for the x-axis, because
it is a categorical variable.
- Demonstrate running code above
- Also show using autocomplete in RStudio when writing the code above
You’ll notice that I used the RStudio autocomplete feature to write function and column names from our dataset. This makes things much easier — especially for column names. If you type column names by hand it’s easy to make spelling errors or typos, and this leads to errors in R.
If we run the code so far, ggplot()
draws the plot axes,
but doesn’t add the data yet:
Run the code so far
So far we have said what we want to show on our plot, but not how it should be shown.
To finish, we add geom_boxplot()
which actually draws
the boxes:
Note that I used a +
symbol to add the boxplot layer to
our graph.
So, now we have a boxplot, with one box drawn for each value of
Species
.
Interpreting boxplots
In a boxplot:
The thick line across the middle of the box is the median or midpoint of the data
The height of the box indicates the interquartile range (IQR). This means the box contains 50% of the datapoints. A wider IQR indicates greater variation (spread) in a dataset.
The meaning of the “whiskers” (the lines above the top, and below the bottom of the box) varies a little bit depending on the software you use — but they always show the range/spread of the data. The default in
ggplot()
are whiskers with lengths covering data points no more, or less than 1.5 times the IQR.Any data point outside the range of the whiskers is described as an ‘outlying point’ or ‘outlier’. Each outlier is plotted individually as a dot.
Exercise 1
- Open
session-3.rmd
using the Files pane. This is the workbook you will be using in this session. - Use the
iris
dataset to create a boxplot withSpecies
on the x-axis andSepal.Width
on the y-axis (sepals are the leaves that encase an iris flower).
Your boxplot should look like this:
Adding colour to a plot
- The points of a scatterplot can be coloured based on a third column of data
- We can colour points using categorical or continuous data
- The
aes(..., colour=column_name)
option selects the column to use (changecolumn_name
to the name of your column)
As we saw before, scatterplots show the relationship between two variables.
Using the mpg
data, we could plot the size of car
engines (displ
, short for displacement) against their fuel
efficiency on the highway in miles per gallon (hwy
):
- Run code to generate plot
NOTE: This first plot is not shown here to save space, but is shown in the video.
We can add colour to distinguish the points in this plot — for example, what kind of car each point represents.
So far, you’ve used the aes()
function to define which
variable is plotted on the x and y axes. The aes()
function
is short for ‘aesthetics’ (what we can see) and connects columns in
our data to visual aspects of a plot, like the x and y axes.
The colour=...
option adds to this, and says which
column is used to colour the points.
In the mpg
data, the drv
column contains
one of three values. The value categorises cars based on which wheels
drive the engine. An f
in drv
means a front
wheel drive car, r
means rear wheel drive, and
4
means four wheel drive.
We can write colour=drv
to tell R to colour each point,
depending on which wheels are driven:
- Add
colour=drv
- Run code to generate plot
# colour each point; different colour for each type of transmission
# front, rear and four wheel drive
mpg %>%
ggplot(aes(displ, hwy, colour=drv)) +
geom_point()
Notice that ggplot()
automatically adds a key to the
plot, mapping the colour of the points to the categories in
drv
.
This is an example of using a categorical variable to colour our points.
Exercise 2
- Use the
development
data from thepsydata
package (runlibrary(psydata)
first if you haven’t already). - Create a scatterplot with
life_expectancy
on the x-axis (along the bottom) andgdp_per_capita
on the y-axis. - Add colour to this scatterplot: make each
continent
a different colour. - Run the chunk of code.
Your plot should look like this:
Using different types of data and visual scales
- Continuous, categorical and text data are all common in psychology
- Internally, R stores columns of data using different data types.
- R uses data types as a clue to set defaults (e.g. for graphs)
- Sometimes we need to convert between types
- For example, sometimes categorical data is (wrongly) stored as a numeric column
# always load the libraries we need first
library(tidyverse)
library(psydata)
# see a list of columns in the dataset, and their types
iris %>% glimpse
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, …
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, …
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, …
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, …
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
# make a scatter plot with two continous axes
# both wt and mpg are numeric variables
# so this works well
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point()
# try and make a boxplot with `am` as the x-axis
# but because `cyl` is stored as a numeric variable
# (not categorical) the scale of the x-axis is wrong
fuel %>%
ggplot(aes(cyl, mpg)) +
geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Ok this is a slightly longer video. In the first bit we introduce some background to the term variable, and the way in which R stores different types of data. In the final part we see why this is important when we’re making plots (and running other analyses).
I’m especially going to talk about:
- the different types of variable we may have in our study (e.g. continuous, categorical, text)
- the way R stores these different data (e.g. as numeric, factor or character data) and
- the way that
ggplot()
presents these different data types
These might seem like small details — but having some understanding of what is happening will help later on: you will be able to control how your data are presented in tables and plots. It will also help you make sure your data are in the right format for other statistical tests.
Columns vs variables
This is really important. Make sure you are clear on the different meanings of the word ‘variable’
The word variable gets used in at least 4 different ways in quantitative research, and this confuses a lot of people.
We can’t avoid all ambiguity because these different usages are common in the field, but it helps to know about them in advance. The main usages are:
variables in a theoretical model. That is, things or constructs we think exist, and which cause other things to happen. An example might be “empathy”, or “working memory”. These are psychological constructs which many theories include, and those theories say these constructs play a causal role in how we behave or perform on certain tasks.
variables in a study design. An example here might an experimental group allocation, or an attribute of participants like their age or gender.
variables in a dataset. Sometimes people want to refer to a column of numbers in a dataset (e.g. a spreadsheet), where the column has a name, and use the term variable again. This usage overlaps with the previous two, but it doesn’t have to. For example, we might have a column of numbers recording the date and time participants’ completed an empathy questionnaire. Analysed a certain way these timestamps might tell us something useful, but the column of timestamps isn’t a variable in our theoretical model, and they might not be used as a variable in our experiment.
[ Finally, we have ]
- variables in R and computer programs: This type of variable is a general purpose container which can store anything. Variables in R can contain columns of numbers, but they can also contain whole datasets, or graphs, or the results of statistical tests. A variable in R often isn’t the same thing as an experimental variable or a column in a dataset.
In this short course use variable to mean either an experimental/theoretical variable or an R container.
When we mean a column of data in a dataset we will use the word column instead.
Types of variable
OK so let’s focus for a minute on variables in a study design.
There are a few main types of these variable to consider. You might be familiar with some of these terms already:
categorical or nominal variables (sometimes also called factors): e.g. gender, education status
binary variables (a special category that is either
TRUE
orFALSE
): e.g. smoker/non-smokerordinal variables (categorical responses where the order is important): e.g. a 1-7 response on a Likert-style question
interval or continuous variables: e.g. weight in kg, time in milliseconds
count variables (a frequency, measured in whole numbers): e.g. the number of social contacts per week
text variables: e.g. free text responses to open-ended questionnaire items
Types of column
OK, so those are the types of variable.
We can also think about the types of column that our dataset stores.
Datasets have multiple columns, each with a unique name.
In R, every column has a data type. These data types are normally determined by the type of variable. But there is some overlap, and different data types could be used to store the same variable.
We sometimes need to convert between data types.
There are four data types you need to know about in R: factors, text, numeric and logical columns.
- Factors, which are used for categorical data (and sometimes ordinal data too)
- Text, also called character or string data.
- Numeric columns can called be either integer or double. Double is computer speak for ‘double-precision number’, which just means decimals and really big numbers are allowed
- Logical, which can be used to store values that can only be either True or False.
The important thing to know is that— although different types of variable are usually stored in different types of column in R— this isn’t always the case.
Sometimes data can be stored as the wrong type—e.g. categorical data could end up stored as numeric data — and in these cases we need to convert from one format to another.
You can see which data type is used to store a variable when using
the glimpse()
command introduced in Session 1 (see here).
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, …
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, …
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, …
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, …
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
In the output you can see the variable names listed on the left,
followed by text in angle brackets, e.g.: <dbl>
which
is the abbreviated name of data type.
In this built-in dataset, most of the data is numeric
(dbl
), but the Species
variable is
categorical, and stored as a factor (fct
).
It’s possible to convert columns from one type to another, to suit our needs. We’ll see more of this later.
Data/column type | Useful for | Abbreviations used/subtypes | Often need to convert from |
---|---|---|---|
Factor | categorical, ordinal | fct , ord |
text |
Text | text, categorical | chr |
categorical |
Numeric | continuous, ordinal | int , dbl |
categorical; text |
Logical | binary (boolean) | lgl |
numeric |
Column types determine the scales on graphs
If we look at the fuel
data (from psydata
)
we can see that most of the columns are stored as numeric data
(dbl
):
Rows: 32
Columns: 7
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, …
$ engine_size <dbl> 2620, 2620, 1770, 4230, 5900, 3690, 5900, 2400, 2310, 2750, 2750, 4520, 4520, 452…
$ power <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230,…
$ weight <dbl> 1188, 1304, 1052, 1458, 1560, 1569, 1619, 1447, 1429, 1560, 1560, 1846, 1692, 171…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, …
$ automatic <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
This is fine if we want to make a scatter plot.
In the example below we have ‘miles per gallon’ vs the weight of cars
in the fuel
dataset (in psydata
):
In this example of a scatterplot both the x and y axes are continuous. That is, we expect them to be real numbers, and they are stored as numeric columns.
However, if we want to make a boxplot of the mpg
column
using cyl
(number of cylinders) as the x
axis
then we have a problem:
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
The warning message says
Continuous x aesthetic -- did you forget aes(group=...)?
.
This is a clue as to what has gone wrong.
We might have expected to see;
- miles per gallon on the \(y\) axis
- three separate boxes, one each for 4, 6 and 8 cylinders
It didn’t work as expected though.
R spots that cyl
is stored as numeric data, and
because it doesn’t know better it creates a _continuous scale) on the
\(x\) axis.
Only one box is drawn at the midpoint of all the values of
cyl
; because cyl
ranges from 4 to 6 the box
appears at 6.
Although R often has good defaults, it isn’t smart — if it doesn’t behave as we expect we need to be more specific.
Here we want to use cyl
as a categorical variable, so we
need to give R precise instructions. We should convert cyl
to a factor
explictly. Then our plot will work
properly.
We can use the command factor(cyl)
to tell R that the
\(x\)-axis is a factor:
This gives us the boxplot we were expecting.
The only change was to replace cyl
with
factor(cyl)
. This tells R to convert the variable
cyl
to a factor, and ggplot()
can then set the
scale of the x-axis correctly.
Exercise 3
Work with a friend: Describe the 4 ways in which quantitative researchers might use the word ‘variable’?
(If you need to look these up from the video or text above then try testing yourself again after completing other exercises).
Exercise 4
Use glimpse()
to check the data types of the
mpg
and the diamonds
datasets.
- The data type of the 4th variable in the
mpg
data is - The data type of the 5th variable in the
diamonds
data is
Exercise 5
Use the fuel
dataset to make a boxplot showing miles per
gallon on the y-axis, and number of gears (gear
) on the
x-axis. Your plot should look like this:
Remember that you will need to change the type of the
gear
variable for this to work.
Using group_by()
to make a summary table and compare
groups
- Datasets often contain categorical variables
- We often want to compare statistics (like averages) between categories
- The
group_by()
function is a quick way to combine filtering and summarising group_by()
creates a groupeddata.frame
- Adding
group_by()
to a pipeline runs the subsequent steps once for each group. - The result is always a new
data.frame
# check the columns in the funimagery dataset
# results of an RCT of functional imagery training for weight loss
# conducted at the University of Plymouth
funimagery %>% glimpse
Rows: 112
Columns: 8
$ gender <chr> "f", "f", "f", "f", "f", "f", "f", "m", "f", "f", "m", "f", "f", "f", "m"…
$ age <int> 44, 32, 33, 21, 27, 56, 50, 57, 34, 25, 70, 56, 55, 43, 37, 45, 60, 51, 2…
$ kg1 <dbl> 107.8, 107.0, 99.5, 80.0, 81.0, 59.0, 95.0, 90.0, 87.0, 121.2, 84.0, 100.…
$ kg2 <dbl> 106.7, 105.4, 101.0, 79.0, 80.0, 57.0, 92.0, 87.0, 87.0, 123.0, 81.9, 97.…
$ kg3 <dbl> 106.0, 105.9, 98.8, 78.0, 80.0, 60.0, 92.0, 87.0, 86.4, 119.7, 84.0, 98.0…
$ person <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,…
$ intervention <fct> MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, M…
$ weight_lost_end_trt <dbl> -1.1, -1.6, 1.5, -1.0, -1.0, -2.0, -3.0, -3.0, 0.0, 1.8, -2.1, -2.8, -2.4…
# boxplot to compare the intervention groups
funimagery %>%
ggplot(aes(intervention, weight_lost_end_trt)) +
geom_boxplot()
# calculate mean weight loss in each group in a laborious way
funimagery %>%
filter(intervention=="MI") %>%
summarise(mean(weight_lost_end_trt))
funimagery %>%
filter(intervention=="FIT") %>%
summarise(mean(weight_lost_end_trt))
# use group_by to split the data and summarise each group
funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt))
# store the result (also a data.frame) in a new variable
average_weight_loss <- funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt))
# example of grouping by two columns at once
funimagery %>%
group_by(gender, intervention) %>%
summarise(mean(weight_lost_end_trt))
# calculate the mean and SD in one go
funimagery %>%
group_by(intervention) %>%
summarise(
mean(weight_lost_end_trt),
sd(weight_lost_end_trt)
)
# how to give your new summary columns a name
# (this is good practice)
funimagery %>%
group_by(intervention) %>%
summarise(
mean_weight_lost_end_trt = mean(weight_lost_end_trt),
sd_weight_lost_end_trt = sd(weight_lost_end_trt)
)
In this video we’ll use the funimagery
dataset, which is
part of the psydata
package. This contains data from a
randomised controlled trial run in Plymouth (Solbrig et al., 2019).
The study compared two treatments for weight loss: Functional Imagery Training (FIT) and Motivational Interviewing (MI).
We can see the columns in the dataset using
glimpse()
:
- Run code
Rows: 112
Columns: 8
$ gender <chr> "f", "f", "f", "f", "f", "f", "f", "m", "f", "f", "m", "f", "f", "f", "m"…
$ age <int> 44, 32, 33, 21, 27, 56, 50, 57, 34, 25, 70, 56, 55, 43, 37, 45, 60, 51, 2…
$ kg1 <dbl> 107.8, 107.0, 99.5, 80.0, 81.0, 59.0, 95.0, 90.0, 87.0, 121.2, 84.0, 100.…
$ kg2 <dbl> 106.7, 105.4, 101.0, 79.0, 80.0, 57.0, 92.0, 87.0, 87.0, 123.0, 81.9, 97.…
$ kg3 <dbl> 106.0, 105.9, 98.8, 78.0, 80.0, 60.0, 92.0, 87.0, 86.4, 119.7, 84.0, 98.0…
$ person <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,…
$ intervention <fct> MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, M…
$ weight_lost_end_trt <dbl> -1.1, -1.6, 1.5, -1.0, -1.0, -2.0, -3.0, -3.0, 0.0, 1.8, -2.1, -2.8, -2.4…
Weight was measured three times, and these measurements are in the
columns called kg1
, kg2
and
kg3
.
- The first measurment (
kg1
) was taken at baseline, when participants joined the study - The second measurement (
kg2
) was taken at the end of treatment, which was 6 months after baseline - The final measurement (
kg3
) was made 12 months after baseline
The weight_lost_end_trt
column is the
difference between kg1
and kg2
–
i.e. how much weight the participant lost after their treatment.
We have already seen a that a boxplot can compare scores between categories:
- Generate plot
This plot is helpful, and makes the differences really clear. But what if we want the actual numbers in a table, or to report in the text of our paper?
Using filter()
and summarise()
One option would be to filter our data first and then summarise each subset in turn:
- Run code
This output shows us the mean for the MI group. We then have to repeat that code for the FIT group:
- Run code
This is a bit repetitive, and would worse if there were lots of
categories. Imagine doing something similar for each continent in the
development
data, for example.
Using group_by()
Instead of using filter()
, we can use
group_by()
to split our dataset into multiple groups,
summarising each one separately:
- Run code
Using group_by()
before summarise()
splits
up the data, in thise case depending on which intervention
the participant received.
- Highlight
group_by()
The summarise()
function calculates the mean for each
group separately.
- Highlight
summarise()
The result is a new data.frame
with two columns: The
name of the intervention
, and the average weight lost. We
could save this in a new variable for use later, if we liked:
average_weight_losses <- funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt))
You can see the new variable in the Environment, so it’s stored for later.
Nested groups
Sometimes we might want to group by more than one variable at once.
Imagine we wanted to see if men and women responded differently to the two treatments?
With group_by()
this is really easy: just add the name
of another column to group on (gender
), separated with a
comma:
It looks like women lost a bit more weight on average
- Highlight
f
rows
but both men and women lost more weight after FIT relative to MI.
- Highlight
MI
andFIT
rows forf
, thenm
Multiple statistics
With summarise()
we can calculate multiple statistics at
once for each of the groups:
- Run code
funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt), sd(weight_lost_end_trt))
Here, we’ve calculated both the mean and standard deviation for each intervention.
Give your new variables a name
R gives the new columns a name based on the function we use to
summarise()
them.
For example, if we use mean()
on the
weight_lost_end_trt
variable then the new column is called
mean(weight_lost_end_trt)
.
- Rerun code above and highlight where the name of the variable comes from
Unfortunately, these new variable names contain parentheses and (sometimes) spaces which makes them harder to work with. It’s much better to give the new column a specific name like this:
funimagery %>%
group_by(intervention) %>%
summarise(mean_weight_lost_end_trt = mean(weight_lost_end_trt), sd_weight_lost_end_trt = sd(weight_lost_end_trt))
Remember, the column names you use shouldn’t include spaces or other ‘special’ characters.
Exercise 6
Replace the ?
in the code below to to calculate the
median
weight lost for men and women undergoing FIT and MI
in the funimagery
dataset.
These are the correct numbers to check your work against:
Gender | Intervention | Median weight loss |
---|---|---|
Female | MI | -1.3 |
Female | FIT | -4.6 |
Male | MI | -2.1 |
Male | FIT | -4.3 |
Exercise 7
Use group_by()
and summarise()
with the
built-in iris
dataset to calculate the mean
Sepal.Length
for each Species
of flower.
These are the correct numbers to check your work against:
Species | Mean sepal length |
---|---|
setosa | 5.0 |
versicolor | 5.9 |
virginica | 6.6 |
Exercise 8
The built-in dataset chickwts
contains weights of chicks
(in grams) fed on different diets. Use group_by()
and
summarise()
to calculate the mean and standard deviation
chick weights for each type of feed.
The mean weight of chicks fed on linseed was g.
The standard deviation of chicks fed on sunflower was g.
Check your knowledge
Write an answer to each of these questions in the
Check your knowledge
section of your workbook. The answers
will be revealed in Session 4.
- Which functions are needed to make a boxplot?
- What is the difference between a
dbl
and afct
orord
? - Which data types are used for continuous variables?
- Give an example of why the difference between
dbl
andfct
matters when making a plot (include code examples for this if you can) - How can you convert a variable from a
dbl
to afct
? - How could you calculate the mean for one level of a factor?
- How would you calculate the mean for all levels of a factor
(e.g. for continent in the
development
dataset)? - How would you calculate means and standard deviations for all combinations of levels in two factors?
Extension exercises
Please remember that these extension exercises are not required to pass the course. We include them because some students work through these materials much more quickly than others — often because they have more previous experience with programming — and we aim to give all students the opportunity to stretch their skills.
If you do find you have extra time these exercises are intended to provide additional practice in the techniques covered and to be useful preparation for using R independently in a stage 4 or MSc research project.
Extension exercise 1
Make a scatterplot of the diamonds
data. Show
carat
on the x-axis, price
on the y-axis, and
the clarity
of the diamonds in colour. Try to produce your
plot before comparing it against the answer using the button
below.
Interpret your plot: Which category of diamond clarity
has the steepest rise in price as size (carat) increases?
Extension exercise 2
Make a scatterplot of the fuel
dataset. Show
mpg
on the y-axis, engine size on the x-axis, and use type
of transmission (auto/manual) to colour the points. Try to produce your
plot before comparing it against the answer using the button below.
Interpret your plot: Which type of car (auto or manal) sees the strongest relationship between engine size and fuel economy?
Extension exercise 3
Using the development
dataset, make a boxplot showing
life expectancy by continent for years greater than 1999. (Hint: use
filter()
, ggplot()
and
geom_boxplot()
.)
The plot should look like this:
Extension exercise 4
Try to recreate the plot below using the mtcars
built-in
dataset. Remember to use factor()
for columns which require
conversion to fct
.
Extension exercise 5
Try to recreate the table below using the painmusic
dataset from psydata
.