Overview
This session doesn’t assume any prior knowledge of R, and introduces the basics. For BSc students this will include some revision of material from stage 1. However we provide additional explanation and extension material for students to test their knowledge and extend familiar skills. We find most students benefit from refreshing their knowledge at this stage in the course.
Even if you are quite confident when using RStudio please read the worksheets carefully and complete all of the activities in the blue boxes.
Techniques covered today
Using the RStudio interface
- Access RStudio at https://psyrstudio.plymouth.ac.uk
- SSO stands for “Single Sign On”
- Use your standard University login details
- Use the latest version of the Chrome web browser, or Firefox
- Tell R what to do in the
Console
pane - See the
Environment
pane for stored data - Use the
Files
pane to open code and data from a folder on the server
The following is not a verbatim transcript of the video, but summarises the main points covered:
To access R we use a web browser, go to the RStudio Server at Plymouth University.
This gives us access to the R software, without having to install it on our own computer.
If you’re using Windows or an older Mac we strongly recommend downloading Chrome or Firefox and using that. If you have any issues with RStudio this is likely the first suggestion we will make.
When you login to RStudio, you’ll be greeted with a screen that looks something like the image below.
You can see three parts:
The Console - This is the large rectangle on the left. It is where you tell R what to do, and where R prints the answers to your questions.
The Environment - This is the rectangle on the top right. It is where R keeps a list of the data it knows about. It’s empty at the moment, because we haven’t given R any data yet.
The Files - This is the rectangle on the bottom right. It’s a bit like the File Explorer in Windows, or the Finder on a Mac. It shows you what files and folders R can see.
You should also be able to see that the two rectangles on the right have a number of other “tabs”. These work like tabs on a web browser.
The top rectangle has the tabs Environment and History. The History tab keeps a record of what you’ve recently typed into the Console. This can sometimes be useful.
The bottom rectangle has the tabs Files, Plots, Packages, Help, and Viewer. We’ll cover what these other tabs do later on.
Before you start
- Before starting you must run some R code to get set up.
- See the code tab or the exercise below and follow the instructions.
To get everyone off to the same start we have created an R script that copies some files into your home folder on the RStudio server, and sets up some preferences for you.
To run this script, we just copy and paste the following line into the Console:
You will see various messages about copying files.
R will then change the directory shown in the “files” pane. You will see a list of “workbooks”. These are documents we have created for you to use to record your work during the course.
- Click on the Console pane.
- Copy-paste the following into the console:
lifesavr::setup()
Your console should now look like this:
Press ↩︎ to run the code. If your console looks like the image below, then you are ready to start the session.
What can R do?
R is a multi-purpose tool
It is a text-based computer language
RStudio is an interface to that language; it organises your work and makes R easier to use
R can do simple arithmetic, load data, make plots etc.
It can also run any statistical analysis you like
You need to tell R exactly what to do, by providing precise instructions
These instructions (code) provide a reproducible record of your work
R is a computer language for data analysis and visualisation.
RStudio is a user interface to R; it helps you organise your work.
R is a text-based language. You interact with it by typing commands and running (also called ‘executing’) them.
R can do everything from simple arithmetic and plotting to complex data analysis.
For example, you can do simple arithmetic like
Or, we could generate some random numbers with a normal distribution
[1] 0.4065467 0.9944206 0.8557684 0.1971289 0.8343250 0.8467902 1.9541053 -2.1492600 0.9711203
[10] 1.1450616
And we could plot random numbers using a histogram
You should think of R as a robot.
The robot is extremely fast, powerful and tireless; but it’s also literal-minded, and won’t think for itself or take the initiative. You need to tell it exactly what to do, by providing very precise instructions.
Aids reproducibility
The advantage of writing detailed instructions is that you have a detailed, reproducible version of all your analyses.
Reproducibility is a key topic in psychology and other natural sciences — learning R (or something like it) is an important skill for new psychologists.
Introducing RMarkdown
- RMarkdown documents combine ‘chunks’ of R code with regular text
- This means you can keep notes and explanations right next to where you process and analyse data
- RMarkdown files end with “
.rmd
” or “.Rmd
” - Make a new chunk:
Ctrl + Alt + i
or⌘ + ⌥ + i
(Mac) - Run a line of code
Ctrl + Enter
orCmd + Enter
(Mac) - You can select part of a statement and run that in the same way
- On Windows, if you get an accented í when using the
Ctrl + Alt + i
shortcut, try to switch your keyboard to “English (US)”: go to the Windows settings > Time & Language > Language > Preferred Languages > Add a language > English (United States) > Next > Install. Once installation finishes, press the shortcutWin + space
to switch to the newly installed ‘English (United States)’ keyboard, and try theCtrl + Alt + i
shortcut again. If this doesn’t work, you will have to enter the code chunks manually.
Finding backticks:
Rmarkdown documents are a good way to use R and RStudio. They help us organize our code alongside explanatory text and notes.
We can work interactively with multiple analyses at once on screen, and do things like comparing different ways of summarizing the same data or different graphs of the same data.
(At this point, it’s probably best to watch the video for a demo of what RMarkdown can do.)
Code chunks
RMarkdown makes a distinction between R code and regular text.
Markers in the document tell RStudio how to treat each part of your work — whether to display it as text, or format it as code (and give us warnings when there are errors and so on).
To mark something as code, we put it inside some special characters, called backticks, creating a code chunk.
A chunk is opened using the symbols ```{r}
, and closed
using the symbols ```
. This is what a chunk looks like in
RStudio:
NOTE: The symbols which start and end a chunk are backticks, not single quotes. The difference is quite subtle.
Backticks are on your keyboard here if you’re on Windows:
Or here if you’re on a Mac:
Running R code inside chunks
In the picture of a chunk above you might have noticed that we had
both some code (2+2
) and the result from that code was
shown beneath the chunk (4
!). To show the result of our
code, we first need to run it.
There are three ways to run R code within a chunk:
- Run a whole chunk at once (can include multiple statements)
- Run one or more related lines, called a statement
- Select and run just part of a statement
Running a statement
The most common case is when we have given R an instruction (written a statement) and we want to run just that new code.
A statement is sometimes one line of code, but it can include multiple lines that are related — we’ll see more of that later.
For now, you can see we have a code chunk here:
In the video we:
- Show cursor interacting with this code chunk
- Click anywhere on the line (this moves your cursor there, highlighting it)
- Run the statement using Ctrl + Enter
You can see that we put the cursor in the middle of this line of
numbers that we wanted to add up. Then I pressed Ctrl + ↵ (this
would be the Cmd
key on a Mac).
This runs or executes the whole line, and the result is shown below the code chunk.
Code chunks with multiple statements
We might want to add some extra calculations to our chunk.
Now we have 2 statements: 2 + 4 + 8 + 16
and
42 * 42
(the star means multiply). We can run either of the
statements in the same way: by putting the cursor anywhere on the line
and pressing Ctrl/Cmd + Enter
In the video we:
- Demonstrate doing this in Rstudio and show that the cursor can be anywhere on the line
- Also, that if we execute a second statement the result of the first one disappears.
- Also we show that if we separate all the statements with a blank line they are easier to read (but that R doesn’t care about this, and can work out whether the lines are related).
Running the whole chunk
To run the whole chunk at once you can either:
- press the green “play” button at the top of the chunk
- press Ctrl+Shift+Enter (Cmd+Shift+Enter on a mac)
Using the workbooks
- Each session has an associated “workbook” file
- They end with the file extension
".rmd"
- These were copied to your home folder by the setup script (above)
- Use them to complete the exercises in the worksheet
Each session has an associated “workbook” file which you will use to
complete the exercises in the worksheet. The file you need for this
session is called session-1-workbook.rmd
.
If you click on the file it opens the workbook in a tab of a new pane, called the Source pane. It’s called the Source pane because statements writting in the R language are often referred to as ‘R code’, which is shorthand for ‘R source code’. The source pane allows you to write R code and explore your data.
Remember to save your workbooks regularly to avoid losing work.
Exercise 1
Click on session-1-workbook.rmd
in the
Files pane to open it
- Locate the first code chunk
- Place your cursor (anywhere) on the line of R code.
- Run the code by pressing Ctrl + ↵ (Windows, Linux) or ⌘ + ↩︎ (Mac).
You should see the result of the sum appear below the chunk, something like this (although the colours will be different):
[1] 42
Congratulations! You have just run your first line of R. You can also run part of a line by highlighting just the code you want to run, as you’ll see in the next exercise.
Exercise 2
- Select (highlight) the last two numbers in the sum.
- Run the code.
This adds up two of the three numbers:
Exercise 3: Making new chunks
- Find the instructions for Exercise 3 in your workbook.
- Create a new chunk below the instructions.
- Inside the chunk, write a line of code which adds together the numbers 9, 4, 55 and 2.
- Run the the line of code you have written.
The output from the chunk should look like this:
- If you typed the backticks to make the code chunk by hand, try
making a new chunk using the keyboard shortcut. On a PC that’s
Ctrl + Alt + I
and on a Mac it’sCommand + Option + I
.
Packages
- Loading a package adds functionality to R
- Some packages (like
tidyverse
andpsydata
) also include example datasets - To load tidyverse write
library(tidyverse)
- Load
tidyverse
andpsydata
before each session
By loading ‘packages’, you can add extra functions and datasets to R.
Packages are a powerful feature which allow R to be extended. This means you can run almost any analysis, or make any type of plot.
Packages are loaded using the library()
function.
The command library(tidyverse)
loads some additional
functions and data which will allow us to make a scatter plot.
The tidyverse
package is so fundamental to this
course that library(tidyverse)
is likely to be the first
line of R code, in the first chunk, in all your RMarkdown
files.
It’s a good idea to start your documents with a chunk which loads any packages you need. This makes it easy to see which have been loaded, and avoids loading them twice which is occasionally a problem.
You also need to remember to actually run the lines of code to load the libraries. Beginners often forget to do this — but it’s an easy error to fix.
In the video we demonstrate an error when running plot code if
tidyverse
hasn’t been loaded
In the video we:
- Load
tidyverse
- Re-run the example plot to show that it now works.
If you’ve understood what packages are it should be clear you need to load them first, before doing anything else.
You can’t use the functions provided by tidyverse
until
you’ve run the command: library(tidyverse)
. And the data in
psydata
is not available until after you load that
package.
For example, if you tried to produce a scatter plot before loading
tidyverse
you’d see an error like this in the console
pane:
Error in diamonds %>% ggplot(aes(carat, price, colour = clarity)) : could not find function "%>%"
This is important to remember:
could not find function
errors are one of the most common
problems that beginners encounter. They normally mean that you have
- forgotten to include
library(tidyverse)
as the first line in your code, or - forgotten to run that line.
Datasets
Datasets are like spreadsheets. They have have:
- multiple rows, with one row per observation
- multiple columns; each column has a name.
- columns also (sometimes) get called variables; this can be confusing
Where are datasets?
- R has some built-in datasets as learning examples
- The
psydata
package includes datasets used in this course - Later on, we will import data from files (e.g. actual spreadsheets)
Exploring and checking data
- View a whole dataset by typing its name and running it in a code chunk
glimpse()
shows a list of all the columns, plus a few of the datapoints- The Environment pane shows a spreadsheet-like view of the data
# always load the tidyverse first
library(tidyverse)
# the psydata package contains datasets for this course
library(psydata)
# display the `fuel` dataset, by typing the name
# and running this in a code chunk
fuel
# show only the first 6 rows of the `fuel` data
head(fuel)
# shows a list of columns in the `development` dataset
# plus the first few datapoints (as many as will fit)
glimpse(development)
When we say “dataset” or “data”, we mean something like a spreadsheet: In R, datasets contain values are organised into columns and rows.
One distinction to make though is that by dataset we mean data that has been loaded into R— data files are different thing.
In R, datasets are normally stored in a container called a
data.frame
. They can also be stored in a
tibble
(these are basically the same thing).
Packaged datasets
Some datasets are built into R packages as examples for beginners.
For this course, we created a package called psydata
which includes the data we need for teaching.
This is installed on the RStudio server. To load it we run:
We can see from the loading message that one of the datasets is
called fuel
. This contains data about cars — things like
weight, fuel economy, engine size.
Let’s display this data in using a new chunk. If we type the word
fuel
, select this variable name with our cursor, and
‘execute’ it, we can see the data it contains:
By default this shows the first ten rows and columns of the data. You
can see other rows using the Next
, Previous
and number buttons below the data.
If your browser window is very narrow you may need to view some of the columns by using the arrow next to the final, right-hand column.
You can get information about the columns in all these example
datasets by typing:
help(name_of_the_dataset_you_want_to_know_about)
. For
example:
Columns
Each column in a dataset has a name.
We sometimes call the columns variables, because each column will often relate to a variable in our study.
However, this can be a bit confusing because — in R — variables can
actually contain whole datasets. fuel
, for example, is the
name of a variable which contains an example dataset, provided by the
psydata
package.
In the video we show library(psydata)
and then the
fuel
dataset.
But these words are used flexibly and interchangeably, so we’ll just have to get used to it. It’s normally clear which type of variable we mean from the context.
Rows
Each row in a dataset represents an observation.
In different datasets an observation might correspond to an individual participant, a whole country, or even just a single button press in an experiment.
Exploring and checking data
There are a two ways we recommend you use inspect and check data you are using.
- Typing the name of the dataset, and running that as code
- The
glimpse()
function, which shows a list of all the columns and some of the data
To use glimpse:
Rows: 32
Columns: 7
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, …
$ engine_size <dbl> 2620, 2620, 1770, 4230, 5900, 3690, 5900, 2400, 2310, 2750, 2750, 4520, 4520, 452…
$ power <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230,…
$ weight <dbl> 1188, 1304, 1052, 1458, 1560, 1569, 1619, 1447, 1429, 1560, 1560, 1846, 1692, 171…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, …
$ automatic <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
glimpse
shows a list of all the columns in the dataset,
the type of data stored in each column, and as many observations
(datapoonts) as will fit on a single line.
glimpse
is a really useful view to check which columns
are available in a dataset before using them.
Why are we talking about cars and flowers and not psychology?
In this course we mostly use very simple datasets, and some of them aren’t even about psychology.
Some students ask why we don’t always use psychological examples. If this hasn’t troubled you then you could skip to the next section, but we thought we should explain:
We think the fuel
dataset (and others, like
iris
, and development
) have a number of
benefits.
First, they are either built into R, loaded in common packages, or
available in the psydata
package. This makes them easily
available for everyone.
Second, these data relate to concrete, easy to understand phenemena (e.g. weight, length, number of gears). This means you don’t have to hold in mind any complex psychological/theoretical ideas for the examples to make sense.
Third, the relationships in these datasets are clear, and there aren’t too many data points. Real data are often more messy because many psychological constructs are hard to measure.
Our experience is that, when learning R, it pays to keep everything as simple as it possibly can be. The skills and concepts involved in analysing these data are the same though.
R — and the techniques and statistics we teach — are used right across the natural sciences
If you’re still not convinced — don’t worry … we do include some clinical examples, and we will be collecting our own psychological data soon enough and analysing that.
Exercise 4
- Open your workbook for this workshop (called
session-1.rmd
). - Create a new chunk below the
Exercise 4
instructions. - Load the
psydata
package. - Look at the fuel data using the
glimpse()
function. - Display the
fuel
dataset and try out the navigation buttons. - Write a line of code which makes a list of columns in the
development
dataset.
The output should look something like this:
Exercise 5
In your workbook (session-1.rmd
):
- Create a new chunk below the
Exercise 5
instructions. - Load the
psydata
package (if you haven’t already). - Show the first 6 rows of the
development
data usinghead()
.
Use the output to answer the following question. After entering your answer, click outside the box. The border will turn turn blue when the answer is correct.
The population of Afghanistan in 1967 was: .
Take a break!
Our student pilot-testers suggested that now would be a good time for a short break!
Scatterplots
- A scatterplot shows the relationship between two continuous variables (columns)
- Each observation (row) must have at least two values (so we need two columns)
- These define the position of a point on the x and y axes of the plot
- Use
ggplot()
aes(x = ..., y = ...)
chooses the x and y data columns and creates the axesgeom_point()
adds the points
# if you have not already, load these packages
library(tidyverse)
library(psydata)
# make a scatterplot from the fuel dataset
fuel %>%
ggplot(aes(x=weight, y=mpg)) + # selects the columns to use
geom_point() # adds the points to the plot
# the same plot, this time we left out x= and y= in
# the aes code. These are implicit from the order of weight and mpg
# (the x-axis comes first)
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point()
A scatterplot shows the relationship between two variables by plotting their values as points on the x-axis (the left-right position) and the y-axis (up-down).
This code chunk creates a scatterplot using the fuel
dataset.
In the video we create a new code chunk and type code into it.
The %>%
symbol is special, it’s called a ‘pipe’.
We’ll cover the pipe in a later session, but for now you just need to
know that it sends the fuel
data on to the next line of
code — like it’s passing it down a pipe.
The second line receives the data. The ggplot()
function
means we’re making a plot.
The plot is built in two steps:
The first step
In the video we:
- select
ggplot(aes(weight, mpg))
selects columns in our dataset to use for the x and y axes. In this
case, the x-axis is weight
, which is the weight of the cars
in kg. And mpg
is miles per gallon, or fuel efficiency.
This will be the y-axis.
In the video we:
- select
weight
in the code - highlight it is the x-axis - same for
mpg
and y-axis
We can see the plot if we run this statement using the keyboard
shortcut — Ctrl
or Cmd
(Mac) +
Enter
In the video we:
- run the code and show the resulting plot
- emphasise this is shown below the code chunk when using RMarkdown
Building plots in layers
A useful thing to know is that ggplot
works by building
up plots in multiple layers.
If we run just this part of the code, we can see the plot with just the axes, and no data shown.
In the video we:
- Run just the first two lines of code by selecting and pressing
ctrl+enter
. - Emphasise the axes are there, but no data shown yet.
- Rerun all code to plot points for each row in the data.
So, conceptually, we make plots by:
- selecting data
- defining the axes, and then
- drawing the data points
Each part of the plot is separated by a +
symbol and
goes on a new line.
RStudio is smart and knows all this is part of the same statement. This means it automatically indents the code.
Cutting corners
There’s just one final thing to explain: In the previous code we
wrote x = weight
and y = mpg
.
This makes things explicit, which is nice, but takes longer to type. You can also write the plot this way:
R assumes that the first column is the x-axis and the second is the y-axis.
We normally drop the x =
and y =
in these
guides, and you should too.
The aes()
function, used with ggplot
is
short for ‘aesthetics’ (what we can see). This part of the code
connects columns in our data to visual aspects of a plot, like
the x and y axes.
Exercise 6
- Create a new chunk below the
Exercise 6
instructions in your workbook. - Using the
fuel
dataset, create a scatterplot withengine_size
on the x-axis andmpg
(miles per gallon, or fuel economy) on the y-axis. - Run the chunk.
The scatterplot should look like this:
Working interactively
Note: most of this section is a recap of the content above.
- A statement is one of more lines of code which R knows are linked
- Run a statement:
Ctrl + ↵
(or⌘ + ↩︎
Mac) - Run the whole chunk:
Shift + Ctrl + ↵
(orShift + ⌘ + ↩︎
Mac)
Note: most of this section is a recap of the content above.
In an earlier video we introduced RMarkdown. In this clip, we recap on some useful things to know when using RMarkdown files in RStudio. It’s intended as a kind of reference for later on.
- Running just part of a statement
- Run a whole chunk using
Cmd+Shift+Enter
orCtrl+Shift+Enter
(Windows) - How R knows where statements start and end
- The importance of blank lines
- Loading packages first
Running parts of statements:
Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_point()`).
- Demonstrate selecting each part in turn again
- Point out that we can check what data has been sent to the plot and amend if necessary
- Demonstrate making an error where column is misspelled
Check your knowledge
Write an answer to each of these questions in the
Check your knowledge
section of your workbook. The answers
will be revealed in Session 2.
- How do you run part of a line of R code using the keyboard short cut?
- Which library will you always need to load in your first RMarkdown chunk?
- What is
psydata
? - How would you look at/inspect a whole dataset?
- What does
glimpse()
do and when is it useful? - What is the 5th column in the
development
dataset? - Which function makes a plot? (there are many, but we mean the one shown above)
- Which function chooses the columns of data used in the plot?
Extension exercises
Please remember that these extension exercises are not required to pass the course. We include them because some students work through these materials much more quickly than others — often because they have more previous experience with programming — and we aim to give all students the opportunity to stretch their skills.
If you do find you have extra time these exercises are intended to provide additional practice in the techniques covered and to be useful preparation for using R independently in a stage 4 or MSc research project.
Extension exercise 1
This scatterplot uses the fuel
dataset to show a
vehicle’s power
on the x-axis against mpg
on
the y-axis.
In a new chunk, write the R code to produce this plot.
Extension exercise 2
There is another built-in dataset called iris
which
includes data about different flower species.
Use glimpse()
to get a list of the column names.
Make a scatterplot which shows the relationships between petal widths and lengths.
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, …
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, …
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, …
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, …
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
Reading
Strongly suggested:
- Read the RMarkdown guide for a complete overview of what they can do: https://rmarkdown.rstudio.com/lesson-1.html
- The page on “Markdown basics” is especially helpful if you want to format your document
Extended material:
- Scatterplots and visualisation: Fundamentals of Data Visualization is an excellent resource for data visualisation in R.
- This chapter: https://clauswilke.com/dataviz/visualizing-associations.html shows many examples of plots which display relationships between variables (including scatter plots) which would extend the material covered above.