Inspiration
In the session we watched Hans Rosling’s “200 countries and 200 years in 4 minutes”, which we (hopefully) agreed is something to aspire to. Combined with his enthusiastic presentation, the visualisations in this clip support a clear narrative and help us understand a complex dataset.
The plot he builds is interesting because it uses many different visual attributes (aesthetics) to express features in the data, including:
- X and Y axes
- Colour
- Size of the points
- Time (in the animation)
These features are carefully selected to highlight important features of the data and support the narrative he provides. Although we need to have integrity in our plotting (we discuss bad examples in the session), this narrative aspect of a plot is important: we always need to consider our audience.
Before you start
Use the ‘files pane’ in RStudio to make a new folder on the RStudio
server to save your work. Call this datafluency202_x_
.
Inside this new folder, make a new RMarkdown file (use the ‘file’
menu and choose ‘new’). When you save the file make sure it has the
extension .rmd
, so call it datavis.rmd
for
example.
Use this new .rmd
file to save your work during this
session.
Recreate the Rosling plot
“Multi-dimensional plotting” sounds fancy, but it just means linking different visual features of a plot to multiple columns in a dataset.
In the example, Rosling’s plot is appealing and informative because it adds multiple dimensions, and uses a special logarithmic scale for the x-axis.
The Rosling plot shown above has dimensions
The additional dimension in the plot shown in the BBC video is:
- x axis (GDP)
- y axis (life exp.)
- color (continent)
- size (population)
Time is the extra dimension, because it shows these other values changing across years.
Defining dimensions/aesthetics
As a reminder: ggplot
uses the term
aesthetics to refer to different dimensions of a
plot.
‘Aesthetics’ refers to ‘what things look like’, and the
aes()
command in ggplot
maps variables
(columns in the dataset) to visual features of the plot.
There are 4 visual features (aesthetics) of plots we will use in this session:
x
andy
axescolour
size
(of a point, or thickness of a line)
We could also use:
- the
fill
colour of shaded areas shape
(of points)- the
size
of text, points or lines linetype
(of added lines, i.e. dotted/patterned or solid)- transparency (called
alpha
) of the features of a plot
(If interested, ask ChatGPT how to add these aesthetics to your plots).
Additionally, we will control the scale of the axes in the plot to improve the presentation of the data.
Rosling’s plot looked something like this:
To create a (slightly simplified) version of the plot above, the code would look something like this:
development %>%
filter(BLANK==BLANK) %>%
ggplot(aes(x=BLANK,
y=BLANK,
size=BLANK,
color=BLANK)) +
geom_point() +
scale_x_log10() +
labs(x=BLANK, y=BLANK, color=BLANK, size=BLANK)
I have removed some parts of the code. Your job is to edit the parts
which say <BLANK>
and replace them with the names of
variables from the development
dataset (available in the
psydata
package).
Some hints:
- All the
BLANK
s represent variable names in the dataset. You can see a list of the column names available by typingglimpse(development)
- If you are confused by the
filter(BLANK==BLANK)
check the title of the plot above. Remember thatfilter
selects particular rows from the data, so we can use it to restrict what is shown in the plot. What data do we need to select for this plot? - The part which reads
scale_log_10()
is explained in more detail below - The part which reads
labs(...)
sets the axis labels (ask ChatGPT about it)
Video hints:
Using multiple layers
When visualizing data, there’s always more than one way to do things.
As well as plotting different dimensions, different types of
plot can highlight different features of the data. In ggplot, different
types of plots are called geometries. Multiple layers
can be combined in the same plot by adding together commands which have
the prefix “geom_
”.
As we have already seen, geom_point()
is used to create
a scatter plot:
To add additional layers to this plot, we can add extra
geom_<NAME>
functions. For example,
geom_smooth
is used to overlay a smooth line to any x/y
plot:
Explanation of the command: We added
+ geom_smooth()
to our previous plot. This means we now
have two geometries added to the same plot: geom_point
and
geom_smooth
.
Explanation of the output: The plot shown is the
same as the previous scatterplot, but now has a smooth blue line
overlaid. This represents the local-average of mpg
, for
each level of weight
. There is also a grey-shaded area,
which represents the standard error of the local average (again there
will be more on this later in the course).
We could add other layers to the plot with other geom_
functions.
For example, we could calculate the average mpg
and
weight of all cars in the dataset and overlay that as a horizontal or
vertical line:
med_mpg <- fuel %>% summarise(median(mpg)) %>% pull(1)
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = med_mpg, color="red")
Explanation of the code First we calculate
the median of mpg
using summarise. Then we use the
pull(1)
command for the first time. This selects the first
column of data (containing our median) and returns only that. That is,
it returns only a number, or a list of numbers, rather than the whole
dataframe. Then We added geom_hline
and
geom_vline
functions to our existing plot. We set the
yintercept
option to the stored med_mpg
value,
and this defines the height of the line. We added
color="red"
to make these added lines distinctive.
Explanation of the output The plot is the
same as before, but now has a red line marked at the mean of the
mpg
column. The red line is on top of the other plot
elements because we added this line at the end of our plotting code.
Make your own layered plot
Use the
mentalh
data-set frompsydata
. Create a scatter plot of screen time and anxiety scores, adapting the code above.Add a smoothed line to the plot using
geom_smooth()
Colour the plot, using the
education
variable.Add a horizontal line to the plot, with the
yintercept
set to the average anxiety score.
Scales
Incomes are “not normal”
If we re-plot the development data using the default settings in ggplot you might notice that the result looks quite different to the one shown in the video or the exercise above.
In particular, the placement of the points looks quite different:
Specifically, we can see that in the left hand panel the points are mostly compressed to the left hand of the frame. In contrast, in the original plot, the points are fairly evenly spread across the x-axis.
These plots are showing the same data. The only difference is that the original plot uses a log scale.
We can recognise the log scale by looking at the markings on the x axis:
- In the left hand panel the markings go up by 30,000 each time.
- In the original plot, each marker represents a value 10 times
larger than the previous. So, 1000, 10,0000 and 100,000 (the values
shown are in scientific
notation, so
1e+03
means 1000).
Skewed distributions
Another way to see why this helps is to plot the distribution of incomes:
Explanation of the code We used the
geom_histogram
function to make histogram of the GDP per
capita variable.
Explanation of the output The histogram shows that most GDP values are below $20,000, but a small number are much, much larger (i.e. > $100,000). This is quite typical of incomes data.
We can then add scale_x_log10()
to the same plot:
Explanation of the code We make another,
histogram this time adding scale_x_log10()
.
Explanation of the output The plot changes, and the distribution is less skewed. We can see that the scale markers are again in scientific notation, and the gaps between points of the scale are not equal: each point on the scale is 10 times larger than the previous one (1000, 10,000, 100,0000), stretching out the values across the x axis and reducing the skew.
Reaction times
In the example above we saw that incomes were not normally distributed and benefited from a log scale. Another common example of ‘non-normal’ data are those from reaction time studies.
For example:
rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')
rtdata %>%
ggplot(aes(rt)) +
geom_histogram()
Explanation of the code The line with
read_csv
takes a web address (URL) and reads a ‘comma
separated values’ data file from it. The next part makes a histogram
using the reaction time (RT) data it contains. In case it is truncated
in the output above, the full url is: https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv
Explanation of the output The RT data are strongly skewed. Consequently the median, 2.01, is lower than the mean value, 2.51.
- Copy and paste the line of code which reads the CSV data and run it.
- Recreate the histogram above (use
geom_histogram
) or a density plot (usegeom_density
) - Add the correct scale function to recreate this plot:
rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')
rtdata %>%
ggplot(aes(rt)) +
geom_histogram() +
scale_x_log10()
Or
`
Animation!
This is an entirely optional exercise. It is not required for the course assessment and is included only becasue students asked how to make animations similar to the one shown in the Rosling video. Skip to the next section if you are short on time.
The gganimate
package allows us to create animations
using ggplot. The package has good documentation here: https://gganimate.com.
As an example we can load the package:
And then adapt our previous ggplot by adding
transition_time(year)
. This adds year
as a
time-based dimension, animating the plot.
We need to save the resulting plot in a variable, and then send that
to the animate
function. Here we use a variable called
progress_plot
.
progress_plot <- development %>%
ggplot(aes(gdp_per_capita, life_expectancy, color=continent)) +
geom_point() +
scale_x_log10() +
transition_time(year)
Try to animate the following plot using data from all the years in
the development
dataset. To do this, amend the code below,
referring to the example above:
development %>%
filter(year == 1952) %>%
ggplot(aes(continent, life_expectancy)) +
geom_boxplot() +
labs(title="Year: 1952")
You need to:
- Remove the filter
- Use the
transition_time()
function
Graphics are for answering questions
Note: you do not need to have completed the extension exercise using animation to engage with this activity.
In this section you don’t need to make any plots yourself: The task is to consider a range of different ways of plotting similar data and to think about which plots are best suited to a range of different tasks.
Consider the following plots:
- An animated boxplot
- Standard boxplot
- Line graph
- Ribbon plot
Animated boxplot
Boxplot
Line graph
This plot shows the median life expectancy in 1952 and 2002.
Ribbon plot
This plot shows the median (line) and IQR (shaded area) in all available years:
- Make a table of the strengths and weaknesses of each plot, like so:
Plot | Strengths | Weaknesses |
---|---|---|
Animated boxplot | … | … |
Boxplot | … | … |
Line graph | … | … |
… | … | … |
(on paper/in your notes is fine)
- Think of some questions that can be answered from these data? For example:
- Which continent changed most between 1952 and 2002?
- Which continent has the most variability in life expectancy?
Your questions might consider factors such as change, rates of change, comparisons between high/low performing countries, heterogeneity/homogeneity among countries, and so on. Make a list of at least 5 or 6 questions which someone might want to know the answer to.
Which plots are most effective in answering each of your questions? Can you identify trade-offs between different plot types?
Imagine you were a journalist writing a story with the title “Asia sees fastest rise in life expectancy since WW2”. Why might you prefer the line graph over the boxplot? What are the pros and cons of the line plot compared with the ribbon plot?
Facets
As we add layers or visual aesthetics (e.g. colours, shapes, lines) our plots become more complex. We may run into trade-offs between information density and clarity.
To give one example, this plot shows life expectancies for each country in the development data, plotted by year:
Explanation: This is another x/y plot. However this time we have not added points, but rather smoothed lines (one for each country).
Explanation of the code:We have created an x/y plot
as before, but this time we only added geom_smooth
(and not
geom_point
), so we can’t see the individual datapoints. We
have also added the text group=country
which means we see
one line per-country in the dataset. Finally, we also added
se=FALSE
. This hides the shaded area which
geom_smooth
adds by default.
Comment on the result: It’s pretty hard to distinguish different countries or make sense of anything!
To increase the information density, and explore patterns within the data, we might add another dimension and aesthetic. The next plot colours each line by continent:
development %>%
ggplot(aes(year, life_expectancy, colour=continent, group=country)) +
geom_smooth(se=FALSE)
However, even with colours added it’s still a bit of a mess. We can’t see the differences between continents very easily.
To clean things up we can use a technique called faceting.
development %>%
ggplot(aes(year, life_expectancy, group=country)) +
geom_smooth(se=FALSE) +
facet_grid(~continent)
Explanation: We added the code
+ facet_grid(.~continent)
to our previous plot, and removed
the color aesthetic. This made ggplot
create individual
panels for each continent. Splitting the graph this way makes
it somewhat easier to compare the differences between
continents.
Try facetting yourself
Use the iris
dataset, which is built into R.
Try to recreate this plot by adapting the code shown for the
development
dataset aboveCreate a second plot which uses colours to distinguish species and does not use Facets
Which plot do you prefer — faceted or coloured? What might influence when facets are more useful than just using colour?
iris %>%
ggplot(aes(Sepal.Length, Petal.Length)) +
geom_point() +
geom_smooth() +
facet_grid(~Species)
There’s no single right answer here, but for this example I prefer the coloured plot to the faceted one. The reason is that there are only 3 species in this dataset, and the points for each don’t overlap much. This means it is easy to distinguish them, even in the combined plot.
However, if there were many different species it might be helpful to use facets instead.
Our decisions should be driven by what we are trying to communicate with the plot. What was the research question that motivated us to draw it?
Adjusting facets
In the explanation above, we saw this plot:
development %>%
ggplot(aes(year, life_expectancy, group=country)) +
geom_smooth(se=FALSE) +
facet_grid(~continent)
Re-run the plot and make the following adaptations:
Try replacing
facet_grid(~continent)
withfacet_grid(continent~.)
. What happens?Try replacing
facet_grid
withfacet_wrap(~continent)
. What happens?If you have time (as an extension exercise), you can see more faceting examples in the
ggplot
cookbook documentation. Try to re-create some of these examples.
Comparing categories
In the examples above we have been plotting continuous variables (and adding colours etc). We’ve used density, scatter and smoothed line plots to do this.
Another common requirement is to use plots to compare summary statistics for different groups or categories.
For example, the classic plot in a psychology journal looks something like this:
However, there is evidence that readers often misinterpret bar plots. Specifically, the problem is that we perceive values within the bar area as more likely than those just above, even though this is not in fact the case.
Often, a better choice is to use a boxplot:
expdata %>%
ggplot(aes(x=Condition, y=RT)) +
geom_boxplot() +
xlab("") +
ylab("Reaction time (ms)") +
theme_minimal()
Explanation: We used Condition
(a
category) as our x-axis, and reaction times as the y axis. We used
geom_boxplot
to show a boxplot. We haven’t yet split the
plot by stimulus (we’ll do this below). The theme_minimal()
function simplifies the appearance of the plot and removes the
background color.
If you’re not familiar with boxplots, there are more details in the
help files (type ?geom_boxplot
into the console) or use the
wikipedia page
here
Loading data over the internet
So far we have always used data from the psydata
package. However, we can also load data from files or over the internet
using the read_csv()
function. This is an example:
Rows: 240
Columns: 4
$ Condition <chr> "Low", "Med", "High", "Low", "Med", "High", "Low", "Med", "High", "Low", "Med", "High", "Low", "Med", "High", "Low", "…
$ stimuli <chr> "Stimulus 1", "Stimulus 1", "Stimulus 1", "Stimulus 2", "Stimulus 2", "Stimulus 2", "Stimulus 3", "Stimulus 3", "Stimu…
$ p <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4…
$ RT <dbl> -10.72829, 313.68298, 556.61563, -80.62659, 560.36673, 586.03610, 264.46975, 271.76747, 267.13422, 186.50668, 290.0398…
Explanation: The part which starts
expdata <- read_csv('https://...
loads a CSV (data) file
over the internet, and stores it in the variable called
expdata
. You can click this
link to see the raw data. We then use glimpse
to check
the data
Load the (simulated) dataset from this url: https://t.ly/yzUO1 (note that this URL is a shortcut and redirects to the full url, which is actally https://gist.githubusercontent.com/benwhalley/7bcc[...]61cc/expdata2.csv. We shortened the link to make it easier to copy and paste).
- Recreate the boxplot above
- Use a facet to recreate the plot you saw above, enabling comparisons
across
Stimuli
(on the x axis) andCondition
.
Bar plots and other plots of summary data
If you wanted to plot the mean and standard error or standard
deviations of different categories, ggplot also has the
stat_summary
command:
Explanation: We used Condition
and
RT
as our x and y axes, as before. This time we added
stat_summary()
instead of geom_boxplot()
. By
default this plots the mean and standard error (a measure of
variability) in each group, using a point-range plot.
This is better than a bar chart because it avoids a known bias in how we
read them.
You can ignore the warning about
No summary function supplied, defaulting to mean_se()
for
now (or ask ChatGPT to explain it and how to get rid of it if it bothers
you).
- Adapt your faceted boxplot from above to show the mean and standard error instead
Making bar plots
Although they are not recommended, stat_summary
is a
convenient way to produce bar plots.
And you can combine multiple calls to stat_summary
to
get a bar chat with error bars on top:
Decisions about grouping
Consider the following two plots. Both present the same data on the incomes of men and women who are either blind or normally sighted:
- Which plot do you prefer, and why?
I don’t have a strong preference between the plots, but I think plot A is probably more useful. My reasoning is that in this plot we can see that:
- Men are paid more
- Blind people are paid less
- The gender-pay gap appears to be larger for blind people
I think that third point is interesting but slightly harder to read/understand from plot B, and so I would choose plot A.
The idea to take-away is that grouping the data in the plot in different ways can make it easier or more difficult to identify different patterns.
- Re-organize the plot below to make it easier to see which level of education has the highest gender pay gap.
earnings %>%
ggplot(aes(x=education, y=income/1000)) +
facet_grid(~gender) +
stat_summary() +
labs(y = "Income (£1000)", x="Gender")
To reorganise the panels, you will need to amend the variables used to facet the plot.
- Try adding
scale_y_log10()
to your updated plot. How does this change the impression you have of the data? Which plot is the most “truthful”?
By adding a log scale to the gender pay gaps appear to be more equal across different education categories.
Either plot can be considered more or less ‘truthful’, depending on your perspective:
If you believe a £2000 pay gap is the same for people earning around £20,000 as it is for those earning £100,000 then the first plot is more helpful. We can see there that largest pay gaps are among the most educated and so highest-earning.
If we consider that £2000 means less to those earning very high incomes, then the second plot is more helpful. This plot shows that, relative to total income, the size of the pay gap is pretty consistent across education categories.
However the elephant in the room is that income has a very skewed distribution:
This means that medians for men/women are very different to the means, and the median gender pay gap is different to the mean gender pay gap:
earnings %>%
group_by(blind, gender) %>%
summarise(m=mean(income)) %>%
group_by(blind) %>%
summarise(diff(m)) %>%
set_names(c("Vision", "Mean pay gap (£)")) %>%
pander::pander()
Vision | Mean pay gap (£) |
---|---|
Blind | 20370 |
Normally sighted | 8473 |
earnings %>%
group_by(blind, gender) %>%
summarise(m=median(income)) %>%
group_by(blind) %>%
summarise(diff(m)) %>%
set_names(c("Vision", "Median pay gap (£)")) %>%
pander::pander()
Vision | Median pay gap (£) |
---|---|
Blind | 1735 |
Normally sighted | 8222 |
The moral here is that different summaries of the data tell different stories: We should explore our data thoroughly, and avoid jumping to conclusions!
Check your knowledge
- What does
.Rmd
stand for (what type of file is it)? - Give three examples of visual ‘aesthetics’ in which can be adjusted
in
ggplot
- Which is correct:
labs(x=VARIABLE_NAME)
orlabs(x="Variable name")
. Why is this? - What symbol do we use to add a
geom_point()
layer to a plot? Is it%>%
or+
? - What value do we need to set for a
geom_hline()
layer when adding it to a plot? - Explain an advantage of using a log-scale plot for income data? Give an example of another type of data, common in psychology, where a log scale might also be useful.
- What sort of plots would allow us to check if a distribution is skewed?
- What effect does
scale_y_log10
have on large vs small values? - What is a facet?
- When might we use facets in our work?
- What is the difference between
facet_wrap
andfacet_grid
? - When choosing variables to use as facets or the x-axis, what principle should guide us?