Data Visualisation

Ben Whalley

Oct. 2023

Inspiration

In the session we watched Hans Rosling’s “200 countries and 200 years in 4 minutes”, which we (hopefully) agreed is something to aspire to. Combined with his enthusiastic presentation, the visualisations in this clip support a clear narrative and help us understand a complex dataset.

The plot he builds is interesting because it uses many different visual attributes (aesthetics) to express features in the data, including:

  • X and Y axes
  • Colour
  • Size of the points
  • Time (in the animation)

These features are carefully selected to highlight important features of the data and support the narrative he provides. Although we need to have integrity in our plotting (we discuss bad examples in the session), this narrative aspect of a plot is important: we always need to consider our audience.

Before you start

Use the ‘files pane’ in RStudio to make a new folder on the RStudio server to save your work. Call this datafluency202_x_.

Inside this new folder, make a new RMarkdown file (use the ‘file’ menu and choose ‘new’). When you save the file make sure it has the extension .rmd, so call it datavis.rmd for example.

Use this new .rmd file to save your work during this session.

Recreate the Rosling plot

“Multi-dimensional plotting” sounds fancy, but it just means linking different visual features of a plot to multiple columns in a dataset.

In the example, Rosling’s plot is appealing and informative because it adds multiple dimensions, and uses a special logarithmic scale for the x-axis.

  • The Rosling plot shown above has dimensions

  • The additional dimension in the plot shown in the BBC video is:

  • x axis (GDP)
  • y axis (life exp.)
  • color (continent)
  • size (population)

Time is the extra dimension, because it shows these other values changing across years.

Defining dimensions/aesthetics

As a reminder: ggplot uses the term aesthetics to refer to different dimensions of a plot.

Aesthetics’ refers to ‘what things look like’, and the aes() command in ggplot maps variables (columns in the dataset) to visual features of the plot.

There are 4 visual features (aesthetics) of plots we will use in this session:

  • x and y axes
  • colour
  • size (of a point, or thickness of a line)

We could also use:

  • the fill colour of shaded areas
  • shape (of points)
  • the size of text, points or lines
  • linetype (of added lines, i.e. dotted/patterned or solid)
  • transparency (called alpha) of the features of a plot

(If interested, ask ChatGPT how to add these aesthetics to your plots).



Additionally, we will control the scale of the axes in the plot to improve the presentation of the data.

Rosling’s plot looked something like this:

To create a (slightly simplified) version of the plot above, the code would look something like this:

development %>%
  filter(BLANK==BLANK)  %>%
  ggplot(aes(x=BLANK,
             y=BLANK,
             size=BLANK,
             color=BLANK)) +
  geom_point() + 
  scale_x_log10() + 
  labs(x=BLANK, y=BLANK, color=BLANK, size=BLANK)

I have removed some parts of the code. Your job is to edit the parts which say <BLANK> and replace them with the names of variables from the development dataset (available in the psydata package).

Some hints:

  • All the BLANKs represent variable names in the dataset. You can see a list of the column names available by typing glimpse(development)
  • If you are confused by the filter(BLANK==BLANK) check the title of the plot above. Remember that filter selects particular rows from the data, so we can use it to restrict what is shown in the plot. What data do we need to select for this plot?
  • The part which reads scale_log_10() is explained in more detail below
  • The part which reads labs(...) sets the axis labels (ask ChatGPT about it)

Video hints:

Using multiple layers

When visualizing data, there’s always more than one way to do things. As well as plotting different dimensions, different types of plot can highlight different features of the data. In ggplot, different types of plots are called geometries. Multiple layers can be combined in the same plot by adding together commands which have the prefix “geom_”.

As we have already seen, geom_point() is used to create a scatter plot:

To add additional layers to this plot, we can add extra geom_<NAME> functions. For example, geom_smooth is used to overlay a smooth line to any x/y plot:

fuel %>% 
  ggplot(aes(weight, mpg)) + 
  geom_point() +
  geom_smooth()

Explanation of the command: We added + geom_smooth() to our previous plot. This means we now have two geometries added to the same plot: geom_point and geom_smooth.

Explanation of the output: The plot shown is the same as the previous scatterplot, but now has a smooth blue line overlaid. This represents the local-average of mpg, for each level of weight. There is also a grey-shaded area, which represents the standard error of the local average (again there will be more on this later in the course).


We could add other layers to the plot with other geom_ functions.

For example, we could calculate the average mpg and weight of all cars in the dataset and overlay that as a horizontal or vertical line:

med_mpg <- fuel %>% summarise(median(mpg)) %>% pull(1)

fuel %>% 
  ggplot(aes(weight, mpg)) + 
  geom_point() +
  geom_smooth() + 
  geom_hline(yintercept = med_mpg, color="red") 

Explanation of the code First we calculate the median of mpg using summarise. Then we use the pull(1) command for the first time. This selects the first column of data (containing our median) and returns only that. That is, it returns only a number, or a list of numbers, rather than the whole dataframe. Then We added geom_hline and geom_vline functions to our existing plot. We set the yintercept option to the stored med_mpg value, and this defines the height of the line. We added color="red" to make these added lines distinctive.

Explanation of the output The plot is the same as before, but now has a red line marked at the mean of the mpg column. The red line is on top of the other plot elements because we added this line at the end of our plotting code.

Make your own layered plot

  1. Use the mentalh data-set from psydata. Create a scatter plot of screen time and anxiety scores, adapting the code above.

  2. Add a smoothed line to the plot using geom_smooth()

  3. Colour the plot, using the education variable.

  4. Add a horizontal line to the plot, with the yintercept set to the average anxiety score.

med_anx <- mentalh %>% summarise(median(anxiety_score)) %>% pull(1)

mentalh %>% 
  ggplot(aes(screen_time, anxiety_score, color=education)) + 
  geom_point() + 
  geom_smooth(se=F) + 
  geom_hline(yintercept = med_anx)

Scales

Incomes are “not normal”

If we re-plot the development data using the default settings in ggplot you might notice that the result looks quite different to the one shown in the video or the exercise above.

In particular, the placement of the points looks quite different:

Specifically, we can see that in the left hand panel the points are mostly compressed to the left hand of the frame. In contrast, in the original plot, the points are fairly evenly spread across the x-axis.

These plots are showing the same data. The only difference is that the original plot uses a log scale.


We can recognise the log scale by looking at the markings on the x axis:

  • In the left hand panel the markings go up by 30,000 each time.
  • In the original plot, each marker represents a value 10 times larger than the previous. So, 1000, 10,0000 and 100,000 (the values shown are in scientific notation, so 1e+03 means 1000).

Skewed distributions

Another way to see why this helps is to plot the distribution of incomes:

development %>% 
  ggplot(aes(gdp_per_capita)) + 
  geom_histogram()

Explanation of the code We used the geom_histogram function to make histogram of the GDP per capita variable.

Explanation of the output The histogram shows that most GDP values are below $20,000, but a small number are much, much larger (i.e. > $100,000). This is quite typical of incomes data.

We can then add scale_x_log10() to the same plot:

development %>% 
  ggplot(aes(gdp_per_capita)) + 
  geom_histogram() + 
  scale_x_log10()

Explanation of the code We make another, histogram this time adding scale_x_log10().

Explanation of the output The plot changes, and the distribution is less skewed. We can see that the scale markers are again in scientific notation, and the gaps between points of the scale are not equal: each point on the scale is 10 times larger than the previous one (1000, 10,000, 100,0000), stretching out the values across the x axis and reducing the skew.

Reaction times

In the example above we saw that incomes were not normally distributed and benefited from a log scale. Another common example of ‘non-normal’ data are those from reaction time studies.

For example:

rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')

rtdata %>% 
  ggplot(aes(rt)) + 
  geom_histogram()

Explanation of the code The line with read_csv takes a web address (URL) and reads a ‘comma separated values’ data file from it. The next part makes a histogram using the reaction time (RT) data it contains. In case it is truncated in the output above, the full url is: https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv

Explanation of the output The RT data are strongly skewed. Consequently the median, 2.01, is lower than the mean value, 2.51.

  1. Copy and paste the line of code which reads the CSV data and run it.
  2. Recreate the histogram above (use geom_histogram) or a density plot (use geom_density)
  3. Add the correct scale function to recreate this plot:

rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')

rtdata %>% 
  ggplot(aes(rt)) + 
  geom_histogram() + 
  scale_x_log10()

Or

rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')

rtdata %>% 
  ggplot(aes(rt)) + 
  geom_density() + 
  scale_x_log10()

`

Animation!

This is an entirely optional exercise. It is not required for the course assessment and is included only becasue students asked how to make animations similar to the one shown in the Rosling video. Skip to the next section if you are short on time.


The gganimate package allows us to create animations using ggplot. The package has good documentation here: https://gganimate.com.

As an example we can load the package:

library(gganimate)

And then adapt our previous ggplot by adding transition_time(year). This adds year as a time-based dimension, animating the plot.

We need to save the resulting plot in a variable, and then send that to the animate function. Here we use a variable called progress_plot.

progress_plot <- development %>%
  ggplot(aes(gdp_per_capita, life_expectancy, color=continent)) +
  geom_point() +
  scale_x_log10() +  
  transition_time(year)
animate(progress_plot,   nframes = 50)

Try to animate the following plot using data from all the years in the development dataset. To do this, amend the code below, referring to the example above:

development %>%
  filter(year == 1952) %>% 
  ggplot(aes(continent, life_expectancy)) +
  geom_boxplot() +
  labs(title="Year: 1952")

You need to:

  • Remove the filter
  • Use the transition_time() function

library(tidyverse)
library(psydata)
library(gganimate)

p <- development %>%
  ggplot(aes(continent, life_expectancy)) +
  geom_boxplot() + 
  labs(title = "Year: {frame_time}") + 
  transition_time(year)

p %>% animate()

Graphics are for answering questions

Note: you do not need to have completed the extension exercise using animation to engage with this activity.

In this section you don’t need to make any plots yourself: The task is to consider a range of different ways of plotting similar data and to think about which plots are best suited to a range of different tasks.

Consider the following plots:

  • An animated boxplot
  • Standard boxplot
  • Line graph
  • Ribbon plot

Animated boxplot

Boxplot

Line graph

This plot shows the median life expectancy in 1952 and 2002.

Ribbon plot

This plot shows the median (line) and IQR (shaded area) in all available years:

  1. Make a table of the strengths and weaknesses of each plot, like so:
Plot Strengths Weaknesses
Animated boxplot
Boxplot
Line graph

(on paper/in your notes is fine)

  1. Think of some questions that can be answered from these data? For example:
  • Which continent changed most between 1952 and 2002?
  • Which continent has the most variability in life expectancy?

Your questions might consider factors such as change, rates of change, comparisons between high/low performing countries, heterogeneity/homogeneity among countries, and so on. Make a list of at least 5 or 6 questions which someone might want to know the answer to.

  1. Which plots are most effective in answering each of your questions? Can you identify trade-offs between different plot types?

  2. Imagine you were a journalist writing a story with the title “Asia sees fastest rise in life expectancy since WW2”. Why might you prefer the line graph over the boxplot? What are the pros and cons of the line plot compared with the ribbon plot?

Facets

As we add layers or visual aesthetics (e.g. colours, shapes, lines) our plots become more complex. We may run into trade-offs between information density and clarity.

To give one example, this plot shows life expectancies for each country in the development data, plotted by year:

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE)

Explanation: This is another x/y plot. However this time we have not added points, but rather smoothed lines (one for each country).

Explanation of the code:We have created an x/y plot as before, but this time we only added geom_smooth (and not geom_point), so we can’t see the individual datapoints. We have also added the text group=country which means we see one line per-country in the dataset. Finally, we also added se=FALSE. This hides the shaded area which geom_smooth adds by default.

Comment on the result: It’s pretty hard to distinguish different countries or make sense of anything!


To increase the information density, and explore patterns within the data, we might add another dimension and aesthetic. The next plot colours each line by continent:

development %>%
  ggplot(aes(year, life_expectancy, colour=continent, group=country)) +
  geom_smooth(se=FALSE)

However, even with colours added it’s still a bit of a mess. We can’t see the differences between continents very easily.

To clean things up we can use a technique called faceting.

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE) +
  facet_grid(~continent)

Explanation: We added the code + facet_grid(.~continent) to our previous plot, and removed the color aesthetic. This made ggplot create individual panels for each continent. Splitting the graph this way makes it somewhat easier to compare the differences between continents.

Try facetting yourself

Use the iris dataset, which is built into R.

  1. Try to recreate this plot by adapting the code shown for the development dataset above

  2. Create a second plot which uses colours to distinguish species and does not use Facets

  3. Which plot do you prefer — faceted or coloured? What might influence when facets are more useful than just using colour?

iris %>%
  ggplot(aes(Sepal.Length, Petal.Length)) +
  geom_point()  +
  geom_smooth() +
  facet_grid(~Species)
iris %>%
  ggplot(aes(Sepal.Length, Petal.Length, color=Species)) +
  geom_point()  +
  geom_smooth() 

There’s no single right answer here, but for this example I prefer the coloured plot to the faceted one. The reason is that there are only 3 species in this dataset, and the points for each don’t overlap much. This means it is easy to distinguish them, even in the combined plot.

However, if there were many different species it might be helpful to use facets instead.

Our decisions should be driven by what we are trying to communicate with the plot. What was the research question that motivated us to draw it?

Adjusting facets

In the explanation above, we saw this plot:

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE) +
  facet_grid(~continent)

Re-run the plot and make the following adaptations:

  1. Try replacing facet_grid(~continent) with facet_grid(continent~.). What happens?

  2. Try replacing facet_grid with facet_wrap(~continent). What happens?

  3. If you have time (as an extension exercise), you can see more faceting examples in the ggplot cookbook documentation. Try to re-create some of these examples.

Comparing categories

In the examples above we have been plotting continuous variables (and adding colours etc). We’ve used density, scatter and smoothed line plots to do this.

Another common requirement is to use plots to compare summary statistics for different groups or categories.

For example, the classic plot in a psychology journal looks something like this:

However, there is evidence that readers often misinterpret bar plots. Specifically, the problem is that we perceive values within the bar area as more likely than those just above, even though this is not in fact the case.

Often, a better choice is to use a boxplot:

expdata  %>%
  ggplot(aes(x=Condition, y=RT)) + 
  geom_boxplot() + 
  xlab("") +
  ylab("Reaction time (ms)") + 
  theme_minimal()

Explanation: We used Condition (a category) as our x-axis, and reaction times as the y axis. We used geom_boxplot to show a boxplot. We haven’t yet split the plot by stimulus (we’ll do this below). The theme_minimal() function simplifies the appearance of the plot and removes the background color.

If you’re not familiar with boxplots, there are more details in the help files (type ?geom_boxplot into the console) or use the wikipedia page here

Loading data over the internet

So far we have always used data from the psydata package. However, we can also load data from files or over the internet using the read_csv() function. This is an example:

expdata <- read_csv('https://t.ly/yzUO1')
expdata %>% glimpse
Rows: 240
Columns: 4
$ Condition <chr> "Low", "Med", "High", "Low", "Med", "High", "Low", "Med", "High", "Low", "Med", "High", "Low", "Med", "High", "Low", "…
$ stimuli   <chr> "Stimulus 1", "Stimulus 1", "Stimulus 1", "Stimulus 2", "Stimulus 2", "Stimulus 2", "Stimulus 3", "Stimulus 3", "Stimu…
$ p         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4…
$ RT        <dbl> -10.72829, 313.68298, 556.61563, -80.62659, 560.36673, 586.03610, 264.46975, 271.76747, 267.13422, 186.50668, 290.0398…

Explanation: The part which starts expdata <- read_csv('https://... loads a CSV (data) file over the internet, and stores it in the variable called expdata. You can click this link to see the raw data. We then use glimpse to check the data

Load the (simulated) dataset from this url: https://t.ly/yzUO1 (note that this URL is a shortcut and redirects to the full url, which is actally https://gist.githubusercontent.com/benwhalley/7bcc[...]61cc/expdata2.csv. We shortened the link to make it easier to copy and paste).

  • Recreate the boxplot above
  • Use a facet to recreate the plot you saw above, enabling comparisons across Stimuli (on the x axis) and Condition.

expdata <- read_csv('https://t.ly/yzUO1')
expdata  %>%
  ggplot(aes(x=Condition, y=RT)) + 
  geom_boxplot() + 
  xlab("") +
  ylab("Reaction time (ms)") +
  facet_wrap(~stimuli, nrow=1) + 
  theme_minimal() 

Bar plots and other plots of summary data

If you wanted to plot the mean and standard error or standard deviations of different categories, ggplot also has the stat_summary command:

expdata %>%
  ggplot(aes(Condition, RT)) + 
  stat_summary()

Explanation: We used Condition and RT as our x and y axes, as before. This time we added stat_summary() instead of geom_boxplot(). By default this plots the mean and standard error (a measure of variability) in each group, using a point-range plot. This is better than a bar chart because it avoids a known bias in how we read them.

You can ignore the warning about No summary function supplied, defaulting to mean_se() for now (or ask ChatGPT to explain it and how to get rid of it if it bothers you).

  • Adapt your faceted boxplot from above to show the mean and standard error instead

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  stat_summary() +
  facet_wrap(~Condition)

Making bar plots

Although they are not recommended, stat_summary is a convenient way to produce bar plots.

expdata %>%
  ggplot(aes(Condition, RT)) + 
  stat_summary(geom="bar")

And you can combine multiple calls to stat_summary to get a bar chat with error bars on top:

expdata %>%
  ggplot(aes(Condition, RT)) + 
  stat_summary(geom="bar") +
  # narrower width looks neater
  stat_summary(geom="errorbar", width=.2) +  
  theme_minimal()

Decisions about grouping

Consider the following two plots. Both present the same data on the incomes of men and women who are either blind or normally sighted:

  1. Which plot do you prefer, and why?

I don’t have a strong preference between the plots, but I think plot A is probably more useful. My reasoning is that in this plot we can see that:

  1. Men are paid more
  2. Blind people are paid less
  3. The gender-pay gap appears to be larger for blind people

I think that third point is interesting but slightly harder to read/understand from plot B, and so I would choose plot A.

The idea to take-away is that grouping the data in the plot in different ways can make it easier or more difficult to identify different patterns.

  1. Re-organize the plot below to make it easier to see which level of education has the highest gender pay gap.
earnings  %>%
  ggplot(aes(x=education, y=income/1000)) + 
  facet_grid(~gender) + 
  stat_summary() +
  labs(y = "Income (£1000)", x="Gender")

To reorganise the panels, you will need to amend the variables used to facet the plot.

earnings  %>%
  ggplot(aes(x=gender, y=income/1000)) + 
  facet_grid(~education) + 
  stat_summary() +
  labs(y = "Income (£ 1000)", x="Gender")

  1. Try adding scale_y_log10() to your updated plot. How does this change the impression you have of the data? Which plot is the most “truthful”?

By adding a log scale to the gender pay gaps appear to be more equal across different education categories.

earnings  %>%
  ggplot(aes(x=gender, y=income/1000)) + 
  facet_grid(~education) + 
  stat_summary() +
  scale_y_log10() +
  labs(y = "Income (£ 1000)", x="Gender")

Either plot can be considered more or less ‘truthful’, depending on your perspective:

  • If you believe a £2000 pay gap is the same for people earning around £20,000 as it is for those earning £100,000 then the first plot is more helpful. We can see there that largest pay gaps are among the most educated and so highest-earning.

  • If we consider that £2000 means less to those earning very high incomes, then the second plot is more helpful. This plot shows that, relative to total income, the size of the pay gap is pretty consistent across education categories.

However the elephant in the room is that income has a very skewed distribution:

earnings %>% 
  ggplot(aes(income)) + geom_density()

This means that medians for men/women are very different to the means, and the median gender pay gap is different to the mean gender pay gap:

earnings  %>%
  group_by(blind, gender) %>% 
  summarise(m=mean(income)) %>% 
  group_by(blind) %>% 
  summarise(diff(m)) %>% 
  set_names(c("Vision", "Mean pay gap (£)")) %>% 
  pander::pander()
Vision Mean pay gap (£)
Blind 20370
Normally sighted 8473
earnings  %>%
  group_by(blind, gender) %>% 
  summarise(m=median(income)) %>% 
  group_by(blind) %>% 
  summarise(diff(m)) %>% 
  set_names(c("Vision", "Median pay gap (£)")) %>% 
  pander::pander()
Vision Median pay gap (£)
Blind 1735
Normally sighted 8222

The moral here is that different summaries of the data tell different stories: We should explore our data thoroughly, and avoid jumping to conclusions!

Check your knowledge

  1. What does .Rmd stand for (what type of file is it)?
  2. Give three examples of visual ‘aesthetics’ in which can be adjusted in ggplot
  3. Which is correct: labs(x=VARIABLE_NAME) or labs(x="Variable name"). Why is this?
  4. What symbol do we use to add a geom_point() layer to a plot? Is it %>% or +?
  5. What value do we need to set for a geom_hline() layer when adding it to a plot?
  6. Explain an advantage of using a log-scale plot for income data? Give an example of another type of data, common in psychology, where a log scale might also be useful.
  7. What sort of plots would allow us to check if a distribution is skewed?
  8. What effect does scale_y_log10 have on large vs small values?
  9. What is a facet?
  10. When might we use facets in our work?
  11. What is the difference between facet_wrap and facet_grid?
  12. When choosing variables to use as facets or the x-axis, what principle should guide us?