In brief

Linear regression is a fancy term for drawing the ‘best-fitting’ line through a scatter plot of two variables, and summarising how well the line describes the data.

When doing regression in R, the relationship between an outcome and one or more predictors is described using a formula. When a model is ‘fitted’ to a sample it becomes tool to make predictions for future samples. It also allows us to quantify our uncertainty about those predictions.

Coefficients are numbers telling us how strong the relationships between predictors and outcomes are. But the meaning of coefficients depend entirely on the design of our study, and the assumptions we are prepared to make. Causal diagrams (as you drew in the first session) can help us decide how to interpret our results.

Before you start

Make a new Rmd file to save your work in this session. Call it regression.rmd. Make sure you save it in the same directory (folder) as your data handling file.

By eye

We want to fit lines though our data because we can use them to make predictions for new data (i.e. future students).

In the workshop we fitted different shape lines (straight, curved) to different samples from the sample population.

The lines might have looked something like this:

We found that the straight line was the worst fit for the initial sample we used to draw the line, as compared with a curved line.

However, when we used the same line to predict observations in a new sample then the straight line was better than the curved.

The principle here is that the curved lines we drew over-fitted the data. This is why we want to use simple models (like straight lines) as much as we possibly can.

Only do this after the workshop; use the time in class to complete the exercises below.

Once you have done the class exercise read this more detailed explanation of what is going on.

First regression model

In this exercise we will use an example dataset which contains 5 variables from a questionnaire about study habits.

The dataset is called studyhabits and is in the psydata package.

studyhabits also contains a test score (grade) and some demographic variables on participants:

studyhabits %>% glimpse

Rows: 300
Columns: 9
$ work_consistently <dbl> 4, 3, 2, 4, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 5, 3, 4, 3, 3, 2, 3, 3, 3, 4, 2, 4, 4, 3, 3, 5, 3…
$ revision_before   <dbl> 3, 4, 4, 3, 4, 4, 5, 3, 4, 4, 4, 4, 3, 3, 2, 3, 4, 4, 4, 2, 3, 3, 4, 4, 5, 4, 3, 5, 3, 3, 2…
$ focus_deadline    <dbl> 5, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 4, 3, 3, 3, 3, 2, 4, 4, 2, 3, 3, 3, 3, 4, 3, 2, 3, 3, 4, 2…
$ progress_everyday <dbl> 4, 2, 2, 5, 2, 3, 3, 2, 3, 3, 2, 3, 4, 2, 2, 2, 3, 2, 3, 2, 3, 4, 4, 2, 3, 4, 2, 4, 2, 3, 2…
$ work_hours        <dbl> 26, 26, 16, 33, 25, 20, 32, 31, 22, 33, 22, 28, 29, 26, 18, 26, 31, 28, 23, 18, 28, 27, 23,…
$ msc_student       <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FA…
$ female            <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE…
$ unique_id         <fct> 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1…
$ grade             <dbl> 65, 57, 45, 70, 55, 50, 57, 69, 62, 63, 75, 67, 62, 63, 45, 64, 57, 68, 66, 21, 62, 64, 61,…

If you want to refer to it, the questionnaire used to generate the data can be viewed here

Load pysdata and tidyverse
Check the data by looking at the first 6 rows using head or glimpse

Plotting the data first

Density plots

Before we do any statistical analyses, we should first plot the data.

For example, a density plot or histogram is good for showing the distribution of test scores.

Using the studyhabits data in psydata:

Reproduce the density plot or histogram from above.
Make a second plot that differentiates the scores of men and women, using colour.

See making a density plot video, at the end of this worksheet

We first encountered density plots in this worksheet

You can make a density plot by writing:

DATA %>% 
  ggplot(aes(y = COLUMN_NAME)) + 
  geom_density()

Or a histogram by writing:

DATA %>% 
  ggplot(aes(y = COLUMN_NAME)) + 
  geom_histogram()

If you want to be fancy you can write: + geom_density(aes(y=..scaled..)) which will set the scale on the y axis to be from 0 to 1 — that is, the value of the line on the y axis is scaled relative to the highest point.

studyhabits %>%
  ggplot(aes(grade, color=female)) +
    geom_density(aes(y=..scaled..))

Describe in words the distribution of the grade variable in this dataset. You can use summarize() to calculate extra statistics like the mean, median or sd to add to the text.

We can see that MCQ grades range from about 20% up to 95%+. The mean score is just over 60, and the data are (roughly normally distributed).

This is good, because the variable we use as the outcome in regression should be continuous and, although not strictly required, it’s often helpful for it to be normally distributed.

Scatter plots

We can also use a scatter plot to show the relationship between MCG grade and responses to one of the questions:

studyhabits %>%
  ggplot(aes(work_consistently, grade)) +
    geom_jitter()

This plot uses geom_jitter() rather than geom_point() to add noise to the data, and so reveal the pattern of responses more clearly.

Using the studyhabits data in psydata:

Make a scatter plot with progress_everyday on the x axis, and grade on the y axis.
Before we start using R to fit lines for us, agree as a group or with a partner:
- What relationship do we see between progress_everyday and grades?
- Would you say this is a weak, moderate, or strong relationship?

studyhabits %>% 
  ggplot(aes(progress_everyday, grade)) + 
  geom_jitter()

The relationship is positive, but only moderate in strength.

The correlation is only 0.32

Using `lm`

To get R to fit a line we use a new function called lm.

The letters l and m stand for linear model.

There are lots of ways to use lm, but the easiest to picture is to use it as part of a ggplot:

Code and explanation shown in the video:

studyhabits %>%
  ggplot(aes(work_hours, grade)) +
  geom_point() +
  geom_smooth(method=lm, se=F)

Explanation of the code: We used geom_point to create a scatterplot. Then we used a plus symbol (+) and added geom_smooth(method=lm) to add the fitted line. By default, geom_smooth would try to fit a curvy line through your datapoints, but adding method=lm makes it a straight line.

Explanation of the resulting plot The plot above is just like the scatter plots we drew before, but adds the blue fitted line. The blue line shows the ‘line of best fit’. This is the line that minimises the residuals (the gaps between the line and the points). Ignore the shaded area for now (explanation here if you are keen).

Don’t worry about trying to add the vertical and horizontal lines at zero you see in this plot. These are just shown in my version to emphasise the fact that we didn’t collect any data where work_hours was < 10.

You can use geom_vline() and geom_hline() to add lines at specific points on each axis. This is the exact code that made the plot above.

studyhabits %>%
  ggplot(aes(work_hours, grade)) +
  geom_point() +
  geom_smooth(method=lm) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)

Using the studyhabits data, plot a line graph with:

The same variables (i.e. reproduce the plot with work_hours and grade)
Different variables (from the studyhabits dataset)
Without the method=lm part (to see a curvy line instead of straight)

Now, add colour to the plot to distinguish between men and women.

Adding colour=female inside the part which says aes(...) should plot different lines for men and women. Something like this:

Summary

We can fit straight lines to data
We use the lm function to do this, within ggplot, using geom_smooth(method=lm)
We can fit multiple lines for different groups (e.g. men and women) by using the color aesthetic (in aes())

Coefficients

The plot using lm is helpful, because we can see the best-fit line.

However, we also want to have a single number to say how steep the line is. That is, a number to say how closely related the variables are.

To do this we can use the lm function directly:

lm(grade ~ work_hours, data = studyhabits)


Call:
lm(formula = grade ~ work_hours, data = studyhabits)

Coefficients:
(Intercept)   work_hours  
    38.3696       0.8377

Explanation of the code The first part tells R to use the lm function. The next part known as a model formula: The ~ symbol (it’s called a ‘tilde’) just means “is predicted by”, so you can read it as “grade is predicted by work hours”. Finally, we tell lm to use the studyhabits data.

Explanation of the output: When we use lm, the output displays:

The ‘Call’ we made (i.e. what inputs we gave, so we can remember how we did it later on)
The ‘coefficients’ of the model. These are the numbers which represent the line on the graph that we saw above.

Understanding the coefficients

In this example, we have two coefficients: the (Intercept) which is 38.3696 and the work_hours coefficient which is 0.8377.

The best way to think about the coefficients is in relation to the plot I showed in the workshop:

We can interpret the coefficients as follows:

The (Intercept) coefficient is the height (on the y axis) where the blue dotted line crosses zero (on the x-axis).
The work_hours coefficient is how steep the slope of the line is. Specifically, is says how many grade points the line will rise if we increase work_hours by 1.

(Watch the previous video on lm again for extra clarification/interpretation of the coefficients).

Using different predictors

The studyhabits data has other variables we could use to predict grades.

Run a number of different regression models using a different predictor each time.
Which of the questions produces the largest slope coefficient?
If you could intervene to change participants, which variable would you like to alter most? (Assuming you wanted to increase test scores)

First 2 questions:

# an example of another model
lm(grade ~ work_consistently, data=studyhabits)

# being an MSc student made the biggest difference!
lm(grade ~ msc_student, data=studyhabits)

Answering question 3 is much harder. In fact there is no simple right answer here.

We might say that we’d want to make all students into MSc students. After all, that was the variable with the largest coefficient, so changing that notionally has the biggest impact on grades.

But that ignores the fact that being an MSc student isn’t likely to cause you to have higher grades. It’s probably an example of confounding: students doing the MSc have been selected to some degree because of their ability to get higher grades on previous modules.

To make a decision between the variables we’d also have to consider the scale of the variables. If one of the predictors ranges between 0 and 7 (for example) whereas another ranges from 0 to 100 then that changes how we interpret the coefficients. A one point change on a 0-7 scale is ‘bigger’ than a one point change on a 0-100 scale. We’ll return to this problem in the final workshop.

Making predictions

By hand

A big advantage of using the coefficients alongside the plot is that we can easily make predictions for future cases.

Let’s say we meet someone who works 30 hours per week. One way to predict would be by-eye, using the line on the plot.

What grade would you expect them to get simply by ‘eye-balling’ the line?

We should expect a grade just over 62%. To see this find 30 on the x axis, and track upwards until you hit the blue line. Then trace left until you see what score is on the y axis. This is the prediction the line makes.

Using coefficients

We can do the same thing using the coefficients from lm, because we know that:

If someone worked for 0 hours per week then our prediction would be 38.37, because this is the intercept value (the point on the line when it is at zero on the x-axis).
For each extra hour we study the work_hourcoefficient tells us that our grade will increase by 0.84.
So, if we study for 30 hours, our prediction is \(38.37 + 0.84 \times 30 = 63.57\)

Using the coefficients from the model lm(grade ~ work_hours, data = studyhabits):

Make predictions for people who study 5, 20 or 40 hours per week
Compare these predictions to the plot above.
Which of the predictions (for 5, 20 or 40 study hours) should we be most confident in? Why is this?

You should get something like: 42.6, 55.1 and 71.9

Don’t worry about rounding errors… within 1 point is fine.

We should be most confident about the prediction for 20 hours because we have more data in the sample which is close to that value. Our line was estimated from data which didn’t have many people who worked 5 or 40 hours, so we don’t really know about what happens at those extremes.

Using `augment`

Rather than making predictions by hand, we save time by using the augment() function.

augment() takes a model, and returns the dataset used to fit it, plus new columns for the model predictions and residuals.

Here’s how it works: If we run a regression, we should save the fitted model with a named variable:

first.model <- lm(grade ~ work_hours, data = studyhabits)

If we feed this model to the augment function, we see a copy of the studyhabits data, with added columns for the model predictions (and some other things).

augment() is in the broom package so we have to load that first.

library(broom)

augment(first.model) %>% glimpse

Rows: 300
Columns: 9
$ grade      <dbl> 65, 57, 45, 70, 55, 50, 57, 69, 62, 63, 75, 67, 62, 63, 45, 64, 57, 68, 66, 21, 62, 64, 61, 52, 61…
$ work_hours <dbl> 26, 26, 16, 33, 25, 20, 32, 31, 22, 33, 22, 28, 29, 26, 18, 26, 31, 28, 23, 18, 28, 27, 23, 26, 29…
$ .fitted    <dbl> 60.15051, 60.15051, 51.77323, 66.01460, 59.31278, 55.12414, 65.17687, 64.33915, 56.79960, 66.01460…
$ .se.fit    <dbl> 0.5132544, 0.5132544, 1.0361769, 0.9700008, 0.5003213, 0.7060814, 0.8839487, 0.8017151, 0.5800900,…
$ .resid     <dbl> 4.8494933, -3.1505067, -6.7732287, 3.9853988, -4.3127789, -5.1241399, -8.1768734, 4.6608544, 5.200…
$ .hat       <dbl> 0.003510245, 0.003510245, 0.014306713, 0.012537653, 0.003335569, 0.006643265, 0.010411808, 0.00856…
$ .sigma     <dbl> 8.672905, 8.675552, 8.668451, 8.674364, 8.673864, 8.672355, 8.664367, 8.673233, 8.672213, 8.675699…
$ .cooksd    <dbl> 5.538940e-04, 2.337732e-04, 4.500801e-03, 1.360695e-03, 4.161295e-04, 1.177755e-03, 4.736241e-03, …
$ .std.resid <dbl> 0.56078448, -0.36431749, -0.78751869, 0.46296434, -0.49867632, -0.59347768, -0.94884671, 0.5403428…

Explanation of the output: The augment function used in line 1 has made a prediction for each row and added it to the original dataset used to fit the model. Alongside the .fitted value, and the .resid (residual) there are some other columns we can ignore for now.

Try using augment to make predictions from one of the regression models you have already run.

Remember, you will need to save the lm result to a new variable before you can use the model with these functions.

For new samples

Often we don’t want a prediction for each row in the original dataset. Rather, we want predictions for specific values of the predictor variable.

To do this, we can use the newdata argument to augment.

First, we create a new single-row dataframe which contains the new predictor values we want a prediction for:

newsamples <- tibble(work_hours=30)
newsamples

# A tibble: 1 x 1
  work_hours
       <dbl>
1         30

Explanation of the code The tibble command makes a new dataframe for us (a tibble is a special kind of dataframe). By writing work_hours=30 we have made a single-row dataframe with a single column called work_hours. We saved this under the name newsamples.

We can use this new dataframe with the augment function:

augment(first.model, newdata=newsamples)

# A tibble: 1 x 3
  work_hours .fitted .se.fit
       <dbl>   <dbl>   <dbl>
1         30    63.5   0.725

Explanation of the output: We have a new data frame with a prediction for a new person who worked 30 hours.

Create a new single-row dataframe to make a prediction (using augment) for someone who worked 33 hours per week.

Consolidation

This video walks through the steps covered today:

Run a regression model to predict grade using different columns in the studyhabits data.
Once you have run a model, describe in words what each of the coefficients mean (write this down)
Talk your explanation through with a TARA or member of staff to check your understanding

Joining data

In the activities above we saw that to run a regression we need at least 2 columns:

the outcome variable and
the predictor variable(s).

Previously, you recoded and created a summary variable for one of the 3 chosen questionnaires. This provides the predictor.

To predict the MCQ score (our outcome measure) we would need to combine two sources of data.

Combining two datasets

Imagine we have two datasets like this, one containing identifying information:

# A tibble: 3 x 2
  Identifier Name  
  <chr>      <chr> 
1 123ABC     Ben   
2 456DEF     Helen 
3 678FGH     Esther

And a second dataset containing sensitive questionnaire data:

# A tibble: 3 x 2
  Identifier `Worst ever record purchase` 
  <chr>      <chr>                        
1 123ABC     Europe, the final countdown  
2 456DEF     Spice Girls Wannabe          
3 678FGH     One Direction - Night changes

Note that the Identifier column is common to both datasets.

We can use a new tidyverse function which joins these two datasets into a combined table:

left_join(personal_data, research_data, by="Identifier")

# A tibble: 3 x 3
  Identifier Name   `Worst ever record purchase` 
  <chr>      <chr>  <chr>                        
1 123ABC     Ben    Europe, the final countdown  
2 456DEF     Helen  Spice Girls Wannabe          
3 678FGH     Esther One Direction - Night changes

Explanation of `left_join`

The left_join function takes two dataframes as input. We call the first input the ‘left hand side’, and the second input the ‘right hand side’.

Both left and right sides have to share at least one column. In this case it’s called Identifier.

When values in Identifier match in the left and right hand side then left_join copies the extra columns from the right hand side into the left hand side.

It returns the combined dataframe.

Any extra rows in the right hand side which don’t match the left hand side are dropped. This might happen if we have research data for people we don’t have personal data for. There are other functions like full_join which do this a bit differently, but we don’t need them for now.

Have a look at the heroes_meta and heroes_personal datasets, in psydata.
Adapt the code for left_join() shown above to combine these datasets.
Pipe the result of left_join() to count() to check how many rows are in the final dataset
Does it matter which order you join the datasets in? Can you explain the number of rows in the resulting dataset?
Save the combined dataset into a new csv file using write_csv(). Check you can find this in the RStudio files pane, and download a copy to be sure.

Do the join:

left_join(heroes_meta, heroes_personal)

                  name         publisher gender     eye_color              race height weight
1              Giganta         DC Comics Female         green                 -   62.5    630
2        Black Goliath     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
3           Spider-Man     Marvel Comics      -           red             Human  178.0     77
4           Spider-Man     Marvel Comics   Male         brown             Human  157.0     56
5           Spider-Man     Marvel Comics   Male         hazel             Human  178.0     74
6       Superboy-Prime         DC Comics   Male          blue        Kryptonian  180.0     77
7        Lady Bullseye     Marvel Comics Female             -                 -     NA     NA
8         Black Canary         DC Comics Female          blue             Human  165.0     58
9         Black Canary         DC Comics Female          blue         Metahuman  170.0     59
10     Black Lightning         DC Comics   Male         brown                 -  185.0     90
11                X-23     Marvel Comics Female         green    Mutant / Clone  155.0     50
12          Silverclaw     Marvel Comics Female         brown                 -  157.0     50
13       Hiro Nakamura      NBC - Heroes   Male             -                 -     NA     NA
14       Beta Ray Bill     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
15      Franklin Storm     Marvel Comics      -          blue                 -  188.0     92
16    Katniss Everdeen                   Female             -             Human     NA     NA
17        Colossal Boy         DC Comics   Male             -                 -     NA     NA
18                Sage     Marvel Comics Female          blue                 -  170.0     61
19         Abomination     Marvel Comics   Male         green Human / Radiation  203.0    441
20               Thing     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
21           Red Robin         DC Comics   Male          blue             Human  165.0     56
22             Deadman         DC Comics   <NA>          <NA>              <NA>     NA     NA
23     Cyborg Superman         DC Comics   Male          blue            Cyborg     NA     NA
24          Paul Blart     Sony Pictures   <NA>          <NA>              <NA>     NA     NA
25        Multiple Man     Marvel Comics   Male          blue                 -  180.0     70
26               Siren         DC Comics Female          blue         Atlantean  175.0     72
27            Stardust     Marvel Comics   Male             -                 -     NA     NA
28       Savage Dragon      Image Comics   Male             -                 -     NA     NA
29           Parademon         DC Comics      -             -         Parademon     NA     NA
30            Valkyrie     Marvel Comics Female          blue                 -  191.0    214
31             Magneto     Marvel Comics   Male          grey            Mutant  188.0     86
32                Kang     Marvel Comics   Male         brown                 -  191.0    104
33      Black Widow II     Marvel Comics Female          blue                 -  170.0     61
34              Boomer     Marvel Comics Female             -                 -     NA     NA
35                 Ink     Marvel Comics   Male          blue            Mutant  180.0     81
36            Arclight     Marvel Comics Female        violet                 -  173.0     57
37     Spider-Woman II     Marvel Comics Female             -                 -     NA     NA
38         Green Arrow         DC Comics   Male         green             Human  188.0     88
39            Firelord     Marvel Comics      -         white                 -  193.0     99
40     Spider-Woman IV     Marvel Comics Female           red                 -  178.0     58
41              Wondra     Marvel Comics Female             -                 -     NA     NA
42          Cogliostro      Image Comics   Male             -                 -     NA     NA
43         Guy Gardner         DC Comics   Male          blue   Human-Vuldarian  188.0     95
44       Jennifer Kale     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
45            Han Solo      George Lucas   Male         brown             Human  183.0     79
46                Vibe         DC Comics   Male         brown             Human  178.0     71
47          Doc Samson     Marvel Comics   Male          blue Human / Radiation  198.0    171
48             Riddler         DC Comics   Male             -                 -     NA     NA
49             Hawkeye     Marvel Comics   Male          blue             Human  191.0    104
50            Brainiac         DC Comics   Male         green           Android  198.0    135
51               Cable     Marvel Comics   Male          blue            Mutant  203.0    158
52         Blue Beetle         DC Comics   Male             -                 -     NA     NA
53  Drax the Destroyer     Marvel Comics   Male           red   Human / Altered  193.0    306
54               Skaar     Marvel Comics   Male         green                 -  198.0    180
55         Nite Owl II         DC Comics   Male             -                 -     NA     NA
56          Purple Man     Marvel Comics   Male        purple             Human  180.0     74
57              Greedo      George Lucas   Male        purple            Rodian  170.0     NA
58         Darth Vader      George Lucas   Male        yellow            Cyborg  198.0    135
59             Vulture     Marvel Comics   Male         brown             Human  180.0     79
60      Rocket Raccoon     Marvel Comics   Male         brown            Animal  122.0     25
61         Plastic Man         DC Comics   Male          blue             Human  185.0     80
62            Evilhawk     Marvel Comics   Male           red             Alien  191.0    106
63           JJ Powell       ABC Studios   Male             -                 -     NA     NA
64      Thunderbird II     Marvel Comics   Male             -                 -     NA     NA
65             Sunspot     Marvel Comics   Male         brown            Mutant  173.0     77
66              Shriek     Marvel Comics Female yellow / blue                 -  173.0     52
67         Marvel Girl     Marvel Comics Female         green                 -  170.0     56
68           Destroyer     Marvel Comics   Male             -                 -  188.0    383
69           Overtkill      Image Comics   Male             -                 -     NA     NA
70         Garbage Man         DC Comics   Male             -            Mutant     NA     NA
71             Chamber     Marvel Comics   Male         brown            Mutant  175.0     63
72           Batman II         DC Comics   Male          blue             Human  178.0     79
73          Sabretooth     Marvel Comics   Male         amber            Mutant  198.0    171
74      Wyatt Wingfoot     Marvel Comics   Male         brown                 -  196.0    117
75            Namorita     Marvel Comics Female          blue                 -  168.0    101
76     Mister Mxyzptlk         DC Comics   Male             -     God / Eternal     NA     NA
77                Bane         DC Comics   Male             -             Human  203.0    180
78              Metron         DC Comics   <NA>          <NA>              <NA>     NA     NA
79           Red Arrow         DC Comics   Male         green             Human  180.0     83
80    Allan Quatermain         Wildstorm   Male             -                 -     NA     NA
81               Swarm     Marvel Comics   Male        yellow            Mutant  196.0     47
82               Steel         DC Comics   Male         brown                 -  201.0    131
83       Hawkwoman III         DC Comics Female          blue                 -  170.0     65
84            Blizzard     Marvel Comics   Male             -                 -     NA     NA
85            Blizzard     Marvel Comics   Male             -                 -     NA     NA
86             Warlock     Marvel Comics   Male           red                 -  188.0    108
87    Black Knight III     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
88            Deadshot         DC Comics   Male         brown             Human  185.0     91
89           King Kong                     Male        yellow            Animal   30.5     NA
90                 Sif     Marvel Comics Female          blue         Asgardian  188.0    191
91             Tempest     Marvel Comics Female         brown                 -  163.0     54
92      Captain Planet     Marvel Comics   Male           red     God / Eternal     NA     NA
93              Beetle     Marvel Comics   Male             -                 -     NA     NA
94         Professor X     Marvel Comics   Male          blue            Mutant  183.0     86
95     Blue Beetle III         DC Comics   Male         brown             Human     NA     NA
96               Fixer     Marvel Comics      -           red                 -     NA     NA
97              Medusa     Marvel Comics Female         green           Inhuman  180.0     59
98          Clock King         DC Comics   Male          blue             Human  178.0     78
99                 Cat     Marvel Comics Female          blue                 -  173.0     61
100              Morph     Marvel Comics   Male         white                 -  178.0     79
101          Donatello    IDW Publishing   Male         green            Mutant     NA     NA
102             T-1000 Dark Horse Comics   Male             -           Android  183.0    146
103           Flash II         DC Comics   <NA>          <NA>              <NA>     NA     NA
104    Man of Miracles      Image Comics      -          blue     God / Eternal     NA     NA
105             Thanos     Marvel Comics   Male           red           Eternal  201.0    443
106             Exodus     Marvel Comics   Male          blue            Mutant  183.0     88
107          Ultragirl     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
108          Archangel     Marvel Comics   Male          blue            Mutant  183.0     68
109       Ra's Al Ghul         DC Comics   Male         green             Human  193.0     97
110               Nova     Marvel Comics   Male         brown             Human  185.0     86
111    Yellowjacket II     Marvel Comics Female          blue             Human  165.0     52
112      Micah Sanders      NBC - Heroes   Male         brown                 -     NA     NA
113    Silk Spectre II         DC Comics Female             -                 -     NA     NA
114             Jigsaw     Marvel Comics   Male          blue                 -  188.0    113
115       Ms Marvel II     Marvel Comics Female          blue                 -  173.0     61
116              Leech     Marvel Comics   Male             -                 -     NA     NA
117           Junkpile     Marvel Comics   Male             -            Mutant     NA     NA
118               Husk     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
119               Silk     Marvel Comics Female         brown             Human     NA     NA
120          Bumblebee         DC Comics Female         brown             Human  170.0     59
121           Mephisto     Marvel Comics   Male         white                 -  198.0    140
122               Yoda      George Lucas   Male         brown    Yoda's species   66.0     17
123    Thunderbird III     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
124   Mister Fantastic     Marvel Comics   Male         brown Human / Radiation  185.0     81
125        Cottonmouth     Marvel Comics   Male         brown             Human  183.0     99
126             Mantis     Marvel Comics Female         green        Human-Kree  168.0     52
127             Nebula     Marvel Comics Female          blue         Luphomoid  185.0     83
128           Question         DC Comics   Male          blue             Human  188.0     83
129            Stacy X     Marvel Comics Female             -                 -     NA     NA
130           Deathlok     Marvel Comics   Male         brown            Cyborg  193.0    178
131              Quill     Marvel Comics   Male         brown                 -  163.0     56
132           Songbird     Marvel Comics Female         green                 -  165.0     65
133            Gravity     Marvel Comics   Male          blue             Human  178.0     79
134          Hobgoblin     Marvel Comics   Male          blue                 -  180.0     83
135      Cameron Hicks              SyFy   Male             -             Alpha     NA     NA
136        Bill Harken              SyFy   Male             -             Alpha     NA     NA
137            Big Man     Marvel Comics   Male          blue                 -  165.0     71
138      Fabian Cortez     Marvel Comics      -          blue                 -  196.0     96
139               Ymir     Marvel Comics   <NA>          <NA>              <NA>     NA     NA
140     Ghost Rider II     Marvel Comics      -             -                 -     NA     NA
141           Mandarin     Marvel Comics   Male          blue             Human  188.0     97
142            Ant-Man     Marvel Comics   Male          blue             Human  211.0    122
 [ reached 'max' / getOption("max.print") -- omitted 547 rows ]

Count the rows:

left_join(heroes_meta, heroes_personal) %>% count()

    n
1 689

left_join(heroes_personal, heroes_meta) %>% count()

    n
1 691

We get slightly different numbers depending on which order we join the data in.

This is because left_join only keeps rows in the right hand dataset IF they have a match in the left hand side.

If the number of rows in the left and right hand sides differ, you can get a different result.

Other functions like inner_join and full_join take different approaches. Type ?dplyr::inner_join for an explanation.

This would export a csv file to the same location as your rmd file:

left_join(heroes_meta, heroes_personal) %>% rio::export('combined.csv')

Regression

Ben Whalley

October 2021

In brief

Before you start

By eye

First regression model

Plotting the data first

Density plots

Scatter plots

Using `lm`

Code and explanation shown in the video:

Summary

Coefficients

Understanding the coefficients

Using different predictors

Making predictions

By hand

Using coefficients

Using `augment`

For new samples

Consolidation

Joining data

Combining two datasets

Explanation of `left_join`

Extra bits

Making a density plot

Regression

Ben Whalley

October 2021

In brief

Before you start

By eye

First regression model

Plotting the data first

Density plots

Scatter plots

Using lm

Code and explanation shown in the video:

Summary

Coefficients

Understanding the coefficients

Using different predictors

Making predictions

By hand

Using coefficients

Using augment

For new samples

Consolidation

Joining data

Combining two datasets

Explanation of left_join

Extra bits

Making a density plot

Using `lm`

Using `augment`

Explanation of `left_join`