Getting started

Ben Whalley, Paul Sharpe, Sonja Heintz, Andy Wills

Overview

This session doesn’t assume any prior knowledge of R, and introduces the basics. For BSc students this will include some revision of material from stage 1. However we provide additional explanation and extension material for students to test their knowledge and extend familiar skills. We find most students benefit from refreshing their knowledge at this stage in the course.

Even if you are quite confident when using RStudio please read the worksheets carefully and complete all of the activities in the blue boxes.

Using the RStudio interface

  • Access RStudio at https://psyrstudio.plymouth.ac.uk
  • SSO stands for “Single Sign On”
  • Use your standard University login details
  • Use the latest version of the Chrome web browser, or Firefox
  • Tell R what to do in the Console pane
  • See the Environment pane for stored data
  • Use the Files pane to open code and data from a folder on the server
# in the video I enter this simple calculation in the console and R computes the result
2 + 2

The following is not a verbatim transcript of the video, but summarises the main points covered:

To access R we use a web browser, go to the RStudio Server at Plymouth University.

This gives us access to the R software, without having to install it on our own computer.

If you’re using Windows or an older Mac we strongly recommend downloading Chrome or Firefox and using that. If you have any issues with RStudio this is likely the first suggestion we will make.


When you login to RStudio, you’ll be greeted with a screen that looks something like the image below.

RStudio on first opening
RStudio on first opening

You can see three parts:

  1. The Console - This is the large rectangle on the left. It is where you tell R what to do, and where R prints the answers to your questions.

  2. The Environment - This is the rectangle on the top right. It is where R keeps a list of the data it knows about. It’s empty at the moment, because we haven’t given R any data yet.

  3. The Files - This is the rectangle on the bottom right. It’s a bit like the File Explorer in Windows, or the Finder on a Mac. It shows you what files and folders R can see.

You should also be able to see that the two rectangles on the right have a number of other “tabs”. These work like tabs on a web browser.

The top rectangle has the tabs Environment and History. The History tab keeps a record of what you’ve recently typed into the Console. This can sometimes be useful.

The bottom rectangle has the tabs Files, Plots, Packages, Help, and Viewer. We’ll cover what these other tabs do later on.

Before you start

  • Before starting you must run some R code to get set up.
  • See the code tab or the exercise below and follow the instructions.
# run this to get copies of all the exercises in your account
lifesavr::setup()

To get everyone off to the same start we have created an R script that copies some files into your home folder on the RStudio server, and sets up some preferences for you.

To run this script, we just copy and paste the following line into the Console:

lifesavr::setup()

You will see various messages about copying files.

R will then change the directory shown in the “files” pane. You will see a list of “workbooks”. These are documents we have created for you to use to record your work during the course.

  1. Click on the Console pane.
  2. Copy-paste the following into the console:

lifesavr::setup()

Your console should now look like this:

Press ↩︎ to run the code. If your console looks like the image below, then you are ready to start the session.

What can R do?

  • R is a multi-purpose tool

  • It is a text-based computer language

  • RStudio is an interface to that language; it organises your work and makes R easier to use

  • R can do simple arithmetic, load data, make plots etc.

  • It can also run any statistical analysis you like

  • You need to tell R exactly what to do, by providing precise instructions

  • These instructions (code) provide a reproducible record of your work

# multiply two numbers
2 * 221

# generate some random numbers with a normal distribution
rnorm(10, 0,1)

# histogram plot of random numbers
hist(rnorm(100, 0,1))

R is a computer language for data analysis and visualisation.

RStudio is a user interface to R; it helps you organise your work.

R is a text-based language. You interact with it by typing commands and running (also called ‘executing’) them.

R can do everything from simple arithmetic and plotting to complex data analysis.

For example, you can do simple arithmetic like

2 * 221

Or, we could generate some random numbers with a normal distribution

rnorm(10, 0,1)
 [1] -1.1927760 -0.7720570 -1.4673709  0.3584614  0.1584155 -1.0289805  0.5357546
 [8] -0.6115973  0.4914359 -0.4999214

And we could plot random numbers using a histogram

hist(rnorm(100, 0,1))


You should think of R as a robot.

The robot is extremely fast, powerful and tireless; but it’s also literal-minded, and won’t think for itself or take the initiative. You need to tell it exactly what to do, by providing very precise instructions.

Aids reproducibility

The advantage of writing detailed instructions is that you have a detailed, reproducible version of all your analyses.

Reproducibility is a key topic in psychology and other natural sciences — learning R (or something like it) is an important skill for new psychologists.

Introducing RMarkdown

  • RMarkdown documents combine ‘chunks’ of R code with regular text
  • This means you can keep notes and explanations right next to where you process and analyse data
  • RMarkdown files end with “.rmd” or “.Rmd
  • Make a new chunk: Ctrl + Alt + I or ⌘ + ⌥ + I (Mac)
  • Run a line of code Ctrl + Enter or Cmd + Enter (Mac)
  • You can select part of a statement and run that in the same way

Finding backticks:

# some arithmetic in R used to show how
# we can run all or part of a statement

2 + 2 + 8 + 16

42 * 42

2 +
  4 +
  8 +
  16

Rmarkdown documents are a good way to use R and RStudio. They help us organize our code alongside explanatory text and notes.

We can work interactively with multiple analyses at once on screen, and do things like comparing different ways of summarizing the same data or different graphs of the same data.

(At this point, it’s probably best to watch the video for a demo of what RMarkdown can do.)

Code chunks

RMarkdown makes a distinction between R code and regular text.

Markers in the document tell RStudio how to treat each part of your work — whether to display it as text, or format it as code (and give us warnings when there are errors and so on).

To mark something as code, we put it inside some special characters, called backticks, creating a code chunk.

A chunk is opened using the symbols ```{r}, and closed using the symbols ```. This is what a chunk looks like in RStudio:

A code chunk in the RMarkdown editor
A code chunk in the RMarkdown editor

NOTE: The symbols which start and end a chunk are backticks, not single quotes. The difference is quite subtle.

Backticks are on your keyboard here if you’re on Windows:

On windows
On windows

Or here if you’re on a Mac:

On a Mac
On a Mac

Running R code inside chunks

In the picture of a chunk above you might have noticed that we had both some code (2+2) and the result from that code was shown beneath the chunk (4!). To show the result of our code, we first need to run it.

There are three ways to run R code within a chunk:

  • Run a whole chunk at once (can include multiple statements)
  • Run one or more related lines, called a statement
  • Select and run just part of a statement

Running a statement

The most common case is when we have given R an instruction (written a statement) and we want to run just that new code.

A statement is sometimes one line of code, but it can include multiple lines that are related — we’ll see more of that later.

For now, you can see we have a code chunk here:

# some arithmetic in R

2 + 4 + 8

In the video we:

  1. Show cursor interacting with this code chunk
  2. Click anywhere on the line (this moves your cursor there, highlighting it)
  3. Run the statement using Ctrl + Enter

You can see that we put the cursor in the middle of this line of numbers that we wanted to add up. Then I pressed Ctrl + (this would be the Cmd key on a Mac).

This runs or executes the whole line, and the result is shown below the code chunk.

Code chunks with multiple statements

We might want to add some extra calculations to our chunk.

# some arithmetic in R

2 + 2 + 8 + 16

42 * 42

2 +
  4 +
  8 +
  16

Now we have 2 statements: 2 + 4 + 8 + 16 and 42 * 42 (the star means multiply). We can run either of the statements in the same way: by putting the cursor anywhere on the line and pressing Ctrl/Cmd + Enter

In the video we:

  • Demonstrate doing this in rstudio show that the cursor can be anywhere on the line and also
  • That if we execute a second statement the result of the first one disappears
  • Also point out that I did separate all the statements with a blank line to make it easier for myself to read (but R doesn’t care about this and can work out whether the lines are related by itself)

Running the whole chunk

To run the whole chunk at once you can either:

  • press the green “play” button at the top of the chunk
  • press Ctrl+Shift+Enter (Cmd+Shift+Enter on a mac)

Using the workbooks

  • Each session has an associated “workbook” file
  • They end with the file extension ".rmd"
  • These were copied to your home folder by the setup script (above)
  • Use them to complete the exercises in the worksheet
# No code shown in this video

Each session has an associated “workbook” file which you will use to complete the exercises in the worksheet. The file you need for this session is called session-1-workbook.rmd.

If you click on the file it opens the workbook in a tab of a new pane, called the Source pane. It’s called the Source pane because statements writting in the R language are often referred to as ‘R code’, which is shorthand for ‘R source code’. The source pane allows you to write R code and explore your data.

Exercise 1

Click on session-1-workbook.rmd in the Files pane to open it

  1. Locate the first code chunk
  2. Place your cursor (anywhere) on the line of R code.
  3. Run the code by pressing Ctrl + (Windows, Linux) or + ↩︎ (Mac).

You should see the result of the sum appear below the chunk, something like this (although the colours will be different):

10 + 12 + 20
[1] 42

Congratulations! You have just run your first line of R. You can also run part of a line by highlighting just the code you want to run, as you’ll see in the next exercise.

Exercise 2

  1. Select (highlight) the last two numbers in the sum.
  2. Run the code.

This adds up two of the three numbers:

Example of running highlighted code
Example of running highlighted code

Exercise 3: Making new chunks

  1. Find the instructions for Exercise 3 in your workbook.
  2. Create a new chunk below the instructions.
  3. Inside the chunk, write a line of code which adds together the numbers 9, 4, 55 and 2.
  4. Run the the line of code you have written.

The output from the chunk should look like this:

Result from Exercise 3
Result from Exercise 3
  1. If you typed the backticks to make the code chunk by hand, try making a new chunk using the keyboard shortcut. On a PC that’s Ctrl + Alt + I and on a Mac it’s Command + Option + I.

Packages

  • Loading a package adds functionality to R
  • Some packages (like tidyverse and psydata) also include example datasets
  • To load tidyverse write library(tidyverse)
  • Load tidyverse and psydata before each session
# load the tidyverse package which includes the ggplot function
# (this also loads the diamonds example dataset, and some others)
library(tidyverse)
# make a plot using new functions available to us
diamonds %>%
  ggplot(aes(carat, price)) +
  geom_point()

By loading ‘packages’, you can add extra functions and datasets to R.

Packages are a powerful feature which allow R to be extended. This means you can run almost any analysis, or make any type of plot.

Packages are loaded using the library() function.

The command library(tidyverse) loads some additional functions and data which will allow us to make a scatter plot.

The tidyverse package is so fundamental to this course that library(tidyverse) is likely to be the first line of R code, in the first chunk, in all your RMarkdown files.

It’s a good idea to start your documents with a chunk which loads any packages you need. This makes it easy to see which have been loaded, and avoids loading them twice which is occasionally a problem.

You also need to remember to actually run the lines of code to load the libraries. Beginners often forget to do this — but it’s an easy error to fix.

  • Demonstrate error when running following plot if tidyverse hasn’t been loaded
library(tidyverse)
diamonds %>%
  ggplot(aes(carat, price)) +
  geom_point()
  1. Load tidyverse
  2. Re-run example plot to show that it now works

If you’ve understood what packages are it should be clear you need to load them first, before doing anything else.

You can’t use the functions provided by tidyverse until you’ve run the command: library(tidyverse). And the data in psydata is not available until after you load that package.

For example, if you tried to produce a scatter plot before loading tidyverse you’d see an error like this in the console pane:

Error in diamonds %>% ggplot(aes(carat, price, colour = clarity)) :
  could not find function "%>%"

This is important to remember: could not find function errors are one of the most common problems that beginners encounter. They normally mean that you have

  1. forgotten to include library(tidyverse) as the first line in your code, or
  2. forgotten to run that line.

Datasets

Datasets are like spreadsheets. They have have:

  • multiple rows, with one row per observation
  • multiple columns; each column has a name.
  • columns also (sometimes) get called variables; this can be confusing

Where are datasets?

  • R has some built-in datasets as learning examples
  • The psydata package includes datasets used in this course
  • Later on, we will import data from files (e.g. actual spreadsheets)

Exploring and checking data

  • View a whole dataset by typing its name and running it in a code chunk
  • glimpse() shows a list of all the columns, plus a few of the datapoints
  • The Environment pane shows a spreadsheet-like view of the data
# always load the tidyverse first
library(tidyverse)

# the psydata package contains datasets for this course
library(psydata)

# display the `fuel` dataset, by typing the name
# and running this in a code chunk
fuel

# show only the first 6 rows of the `fuel` data
head(fuel)

# shows a list of columns in the `development` dataset
# plus the first few datapoints (as many as will fit)
glimpse(development)

When we say “dataset” or “data”, we mean something like a spreadsheet: In R, datasets contain values are organised into columns and rows.

One distinction to make though is that by dataset we mean data that has been loaded into R— data files are different thing.

In R, datasets are normally stored in a container called a data.frame. They can also be stored in a tibble (these are basically the same thing).

Packaged datasets

Some datasets are built into R packages as examples for beginners.

For this course, we created a package called psydata which includes the data we need for teaching.

This is installed on the RStudio server. To load it we run:

library(psydata)

We can see from the loading message that one of the datasets is called fuel. This contains data about cars — things like weight, fuel economy, engine size.

Let’s display this data in using a new chunk. If we type the word fuel, select this variable name with our cursor, and ‘execute’ it, we can see the data it contains:

fuel

By default this shows the first ten rows and columns of the data. You can see other rows using the Next, Previous and number buttons below the data.

If your browser window is very narrow you may need to view some of the columns by using the arrow next to the final, right-hand column.

You can get information about the columns in all these example datasets by typing: help(name_of_the_dataset_you_want_to_know_about). For example:

help(fuel)

Columns

Each column in a dataset has a name.

We sometimes call the columns variables, because each column will often relate to a variable in our study.

However, this can be a bit confusing because — in R — variables can actually contain whole datasets. fuel, for example, is the name of a variable which contains an example dataset, provided by the psydata package.

  • Show library(psydata) and then the fuel dataset

But these words are used flexibly and interchangeably, so we’ll just have to get used to it. It’s normally clear which type of variable we mean from the context.

Rows

Each row in a dataset represents an observation.

In different datasets an observation might correspond to an individual participant, a whole country, or even just a single button press in an experiment.

Exploring and checking data

There are a two ways we recommend you use inspect and check data you are using.

  1. Typing the name of the dataset, and running that as code
  2. The glimpse() function, which shows a list of all the columns and some of the data

To use glimpse:

fuel
glimpse(fuel)
Rows: 32
Columns: 7
$ mpg         <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
$ cyl         <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
$ engine_size <dbl> 2620, 2620, 1770, 4230, 5900, 3690, 5900, 2400, 2310, 2750, 27…
$ power       <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
$ weight      <dbl> 1188, 1304, 1052, 1458, 1560, 1569, 1619, 1447, 1429, 1560, 15…
$ gear        <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
$ automatic   <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…

glimpse shows a list of all the columns in the dataset, the type of data stored in each column, and as many observations (datapoonts) as will fit on a single line.

glimpse is a really useful view to check which columns are available in a dataset before using them.

Why are we talking about cars and flowers and not psychology?

In this course we mostly use very simple datasets, and some of them aren’t even about psychology.

Some students ask why we don’t always use psychological examples. If this hasn’t troubled you then you could skip to the next section, but we thought we should explain:

We think the fuel dataset (and others, like iris, and development) have a number of benefits.

First, they are either built into R, loaded in common packages, or available in the psydata package. This makes them easily available for everyone.

Second, these data relate to concrete, easy to understand phenemena (e.g. weight, length, number of gears). This means you don’t have to hold in mind any complex psychological/theoretical ideas for the examples to make sense.

Third, the relationships in these datasets are clear, and there aren’t too many data points. Real data are often more messy because many psychological constructs are hard to measure.

Our experience is that, when learning R, it pays to keep everything as simple as it possibly can be. The skills and concepts involved in analysing these data are the same though.

R — and the techniques and statistics we teach — are used right across the natural sciences

If you’re still not convinced — don’t worry … we do include some clinical examples, and we will be collecting our own psychological data soon enough and analysing that.

Exercise 4

  1. Open your workbook for this workshop (called session-1.rmd).
  2. Create a new chunk below the Exercise 4 instructions.
  3. Load the psydata package.
  4. Look at the fuel data using the glimpse() function.
  5. Display the fuel dataset and try out the navigation buttons.
  6. Write a line of code which makes a list of columns in the development dataset.

The output should look something like this:

The fuel dataset
The fuel dataset
Columns in the development dataset
Columns in the development dataset

Exercise 5

In your workbook (session-1.rmd):

  1. Create a new chunk below the Exercise 5 instructions.
  2. Load the psydata package (if you haven’t already).
  3. Show the first 6 rows of the development data using head().

Use the output to answer the following question. After entering your answer, click outside the box. The border will turn turn blue when the answer is correct.

The population of Afghanistan in 1967 was: .

Take a break!

Our student pilot-testers suggested that now would be a good time for a short break!

Scatterplots

  • A scatterplot shows the relationship between two continuous variables (columns)
  • Each observation (row) must have at least two values (so we need two columns)
  • These define the position of a point on the x and y axes of the plot
  • Use ggplot()
  • aes(x = ..., y = ...) chooses the x and y data columns and creates the axes
  • geom_point() adds the points
# if you have not already, load these packages
library(tidyverse)
library(psydata)

# make a scatterplot from the fuel dataset
fuel %>%
  ggplot(aes(x=weight, y=mpg)) + # selects the columns to use
    geom_point()                 # adds the points to the plot

# the same plot, this time we left out x= and y= in
# the aes code. These are implicit from the order of weight and mpg
# (the x-axis comes first)
fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()         

A scatterplot shows the relationship between two variables by plotting their values as points on the x-axis (the left-right position) and the y-axis (up-down).

This code chunk creates a scatterplot using the fuel dataset.

  • Create chunk and type code
library(tidyverse)
library(psydata)
fuel %>%
  ggplot(aes(x = weight, y = mpg)) +
    geom_point()

  • select the pipe with the cursor to highlight

The %>% symbol is special, it’s called a ‘pipe’. We’ll cover the pipe in a later session, but for now you just need to know that it sends the fuel data on to the next line of code — like it’s passing it down a pipe.

The second line receives the data. The ggplot() function means we’re making a plot.

The plot is built in two steps:

The first step

  • select ggplot(aes(weight, mpg))

selects columns in our dataset to use for the x and y axes. In this case, the x-axis is weight, which is the weight of the cars in kg. And mpg is miles per gallon, or fuel efficiency. This will be the y-axis.

  • select weight in the code - highlight it is the x-axis
  • same for mpg and y-axis

We can see the plot if we run this statement using the keyboard shortcut — Ctrl or Cmd (Mac) + Enter

  • run the code and show the resulting plot
  • emphasise this is shown below the code chunk when using RMarkdown

Building plots in layers

A useful thing to know is that ggplot works by building up plots in multiple layers.

If we run just this part of the code, we can see the plot with just the axes, and no data shown.

  1. run just the first two lines of code by selecting and pressing ctrl+enter
  2. emphasise the axes are there but no data shown
  3. rerun all code to plot points

So, conceptually, we make plots by:

  1. selecting data
  2. defining the axes, and then
  3. drawing the data points

Each part of the plot is separated by a + symbol and goes on a new line.

RStudio is smart and knows all this is part of the same statement. This means it automatically indents the code.

Cutting corners

There’s just one final thing to explain: In the previous code we wrote x = weight and y = mpg.

This makes things explicit, which is nice, but takes longer to type. You can also write the plot this way:

fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()         

R assumes that the first column is the x-axis and the second is the y-axis.

  • select x-axis and y-axis in turn when describing

We normally drop the x = and y = in these guides, and you should too.

The aes() function, used with ggplot is short for ‘aesthetics’ (what we can see). This part of the code connects columns in our data to visual aspects of a plot, like the x and y axes.

Exercise 6

  1. Create a new chunk below the Exercise 6 instructions in your workbook.
  2. Using the fuel dataset, create a scatterplot with engine_size on the x-axis and mpg (miles per gallon, or fuel economy) on the y-axis.
  3. Run the chunk.

The scatterplot should look like this:

Working interactively

  • A statement is one of more lines of code which R knows are linked
  • Run a statement: Ctrl + ↵ (or ⌘ + ↩︎ Mac)
  • Run the whole chunk: Shift + Ctrl + ↵ (or Shift + ⌘ + ↩︎ Mac)
# No new code shown in this video

In an earlier video we introduced RMarkdown. In this clip, we recap on some useful things to know when using RMarkdown files in RStudio. It’s intended as a kind of reference for later on.

  • Running just part of a statement
  • Run a whole chunk using Cmd+Shift+Enter or Ctrl+Shift+Enter (Windows)
  • How R knows where statements start and end
  • The importance of blank lines
  • Loading packages first

Running parts of statements:

happy %>%
  filter(year==2018) %>%
  ggplot(aes(gdp_per_capita, happiness)) +
  geom_point()
Warning: Removed 6 rows containing missing values (`geom_point()`).

  • Demonstrate selecting each part in turn again
  • Point out that we can check what data has been sent to the plot and amend if necessary
  • Demonstrate making an error where column is misspelled

Check your knowledge

Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 2.

  1. How do you run part of a line of R code using the keyboard short cut?
  2. Which library will you always need to load in your first RMarkdown chunk?
  3. What is psydata?
  4. How would you look at/inspect a whole dataset?
  5. What does glimpse() do and when is it useful?
  6. What is the 5th column in the development dataset?
  7. Which function makes a plot? (there are many, but we mean the one shown above)
  8. Which function chooses the columns of data used in the plot?

Extension exercises

Please remember that these extension exercises are not required to pass the course. We include them because some students work through these materials much more quickly than others — often because they have more previous experience with programming — and we aim to give all students the opportunity to stretch their skills.

If you do find you have extra time these exercises are intended to provide additional practice in the techniques covered and to be useful preparation for using R independently in a stage 4 or MSc research project.

Extension exercise 1

This scatterplot uses the fuel dataset to show a vehicle’s power on the x-axis against mpg on the y-axis.

In a new chunk, write the R code to produce this plot.

Extension exercise 2

There is another built-in dataset called iris which includes data about different flower species.

Use glimpse() to get a list of the column names.

Make a scatterplot which shows the relationships between petal widths and lengths.

Reading

Strongly suggested:

Extended material: