In brief
Linear regression is a fancy term for drawing the ‘best-fitting’ line through a scatter plot of two variables, and summarising how well the line describes the data.
When doing regression in R, the relationship between an outcome and one or more predictors is described using a formula. When a model is ‘fitted’ to a sample it becomes tool to make predictions for future samples. It also allows us to quantify our uncertainty about those predictions.
Coefficients are numbers telling us how strong the relationships between predictors and outcomes are. But the meaning of coefficients depend entirely on the design of our study, and the assumptions we are prepared to make. Causal diagrams (as you drew in the first session) can help us decide how to interpret our results.
Before you start
- Make a new Rmd file to save your work in this session. Call it
regression.rmd
. Make sure you save it in the same directory (folder) as your data handling file.
By eye
We want to fit lines though our data because we can use them to make predictions for new data (i.e. future students).
In the workshop we fitted different shape lines (straight, curved) to different samples from the sample population.
The lines might have looked something like this:
We found that the straight line was the worst fit for the initial sample we used to draw the line, as compared with a curved line.
However, when we used the same line to predict observations in a new sample then the straight line was better than the curved.
The principle here is that the curved lines we drew over-fitted the data. This is why we want to use simple models (like straight lines) as much as we possibly can.
Only do this after the workshop; use the time in class to complete the exercises below.
Once you have done the class exercise read this more detailed explanation of what is going on.
First regression model
In this exercise we will use an example dataset which contains 5 variables from a questionnaire about study habits.
The dataset is called studyhabits
and is in the psydata
package.
studyhabits
also contains a test score (grade
) and some demographic variables on participants:
%>% glimpse studyhabits
Rows: 300
Columns: 9
$ work_consistently <dbl> 4, 3, 2, 4, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 5, 3, 4, 3, 3, 2, 3, 3, 3, 4, 2, 4, 4, 3, 3, 5, 3…
$ revision_before <dbl> 3, 4, 4, 3, 4, 4, 5, 3, 4, 4, 4, 4, 3, 3, 2, 3, 4, 4, 4, 2, 3, 3, 4, 4, 5, 4, 3, 5, 3, 3, 2…
$ focus_deadline <dbl> 5, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 4, 3, 3, 3, 3, 2, 4, 4, 2, 3, 3, 3, 3, 4, 3, 2, 3, 3, 4, 2…
$ progress_everyday <dbl> 4, 2, 2, 5, 2, 3, 3, 2, 3, 3, 2, 3, 4, 2, 2, 2, 3, 2, 3, 2, 3, 4, 4, 2, 3, 4, 2, 4, 2, 3, 2…
$ work_hours <dbl> 26, 26, 16, 33, 25, 20, 32, 31, 22, 33, 22, 28, 29, 26, 18, 26, 31, 28, 23, 18, 28, 27, 23,…
$ msc_student <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FA…
$ female <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE…
$ unique_id <fct> 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1…
$ grade <dbl> 65, 57, 45, 70, 55, 50, 57, 69, 62, 63, 75, 67, 62, 63, 45, 64, 57, 68, 66, 21, 62, 64, 61,…
If you want to refer to it, the questionnaire used to generate the data can be viewed here
- Load
pysdata
andtidyverse
- Check the data by looking at the first 6 rows using
head
orglimpse
Plotting the data first
Density plots
Before we do any statistical analyses, we should first plot the data.
For example, a density plot or histogram is good for showing the distribution of test scores.
Using the studyhabits
data in psydata
:
- Reproduce the density plot or histogram from above.
- Make a second plot that differentiates the scores of men and women, using colour.
See making a density plot video, at the end of this worksheet
We first encountered density plots in this worksheet
You can make a density plot by writing:
%>%
DATA ggplot(aes(y = COLUMN_NAME)) +
geom_density()
Or a histogram by writing:
%>%
DATA ggplot(aes(y = COLUMN_NAME)) +
geom_histogram()
If you want to be fancy you can write: + geom_density(aes(y=..scaled..))
which will set the scale on the y axis to be from 0 to 1 — that is, the value of the line on the y axis is scaled relative to the highest point.
%>%
studyhabits ggplot(aes(grade, color=female)) +
geom_density(aes(y=..scaled..))
- Describe in words the distribution of the
grade
variable in this dataset. You can usesummarize()
to calculate extra statistics like themean
,median
orsd
to add to the text.
We can see that MCQ grades range from about 20% up to 95%+. The mean score is just over 60, and the data are (roughly normally distributed).
This is good, because the variable we use as the outcome in regression should be continuous and, although not strictly required, it’s often helpful for it to be normally distributed.
Scatter plots
We can also use a scatter plot to show the relationship between MCG grade
and responses to one of the questions:
%>%
studyhabits ggplot(aes(work_consistently, grade)) +
geom_jitter()
This plot uses
geom_jitter()
rather thangeom_point()
to add noise to the data, and so reveal the pattern of responses more clearly.
Using the studyhabits
data in psydata
:
Make a scatter plot with
progress_everyday
on the x axis, andgrade
on the y axis.Before we start using R to fit lines for us, agree as a group or with a partner:
- What relationship do we see between
progress_everyday
and grades? - Would you say this is a weak, moderate, or strong relationship?
- What relationship do we see between
%>%
studyhabits ggplot(aes(progress_everyday, grade)) +
geom_jitter()
The relationship is positive, but only moderate in strength.
The correlation is only 0.32
Using lm
To get R to fit a line we use a new function called lm
.
The letters l and m stand for linear model.
There are lots of ways to use lm
, but the easiest to picture is to use it as part of a ggplot
:
Code and explanation shown in the video:
%>%
studyhabits ggplot(aes(work_hours, grade)) +
geom_point() +
geom_smooth(method=lm, se=F)
Explanation of the code: We used geom_point
to create a scatterplot. Then we used a plus symbol (+
) and added geom_smooth(method=lm)
to add the fitted line. By default, geom_smooth
would try to fit a curvy line through your datapoints, but adding method=lm
makes it a straight line.
Explanation of the resulting plot The plot above is just like the scatter plots we drew before, but adds the blue fitted line. The blue line shows the ‘line of best fit’. This is the line that minimises the residuals (the gaps between the line and the points). Ignore the shaded area for now (explanation here if you are keen).
Don’t worry about trying to add the vertical and horizontal lines at zero you see in this plot. These are just shown in my version to emphasise the fact that we didn’t collect any data where work_hours
was < 10.
You can use geom_vline()
and geom_hline()
to add lines at specific points on each axis. This is the exact code that made the plot above.
%>%
studyhabits ggplot(aes(work_hours, grade)) +
geom_point() +
geom_smooth(method=lm) +
geom_vline(xintercept=0) +
geom_hline(yintercept=0)
- Using the
studyhabits
data, plot a line graph with:
- The same variables (i.e. reproduce the plot with
work_hours
andgrade
) - Different variables (from the
studyhabits
dataset) - Without the
method=lm
part (to see a curvy line instead of straight)
- Now, add colour to the plot to distinguish between men and women.
Adding colour=female
inside the part which says aes(...)
should plot different lines for men and women. Something like this:
Summary
- We can fit straight lines to data
- We use the
lm
function to do this, withinggplot
, usinggeom_smooth(method=lm)
- We can fit multiple lines for different groups (e.g. men and women) by using the color aesthetic (in
aes()
)
Coefficients
The plot using lm
is helpful, because we can see the best-fit line.
However, we also want to have a single number to say how steep the line is. That is, a number to say how closely related the variables are.
To do this we can use the lm
function directly:
lm(grade ~ work_hours, data = studyhabits)
Call:
lm(formula = grade ~ work_hours, data = studyhabits)
Coefficients:
(Intercept) work_hours
38.3696 0.8377
Explanation of the code The first part tells R to use the lm
function. The next part known as a model formula: The ~
symbol (it’s called a ‘tilde’) just means “is predicted by”, so you can read it as “grade is predicted by work hours”. Finally, we tell lm
to use the studyhabits
data.
Explanation of the output: When we use lm, the output displays:
- The ‘Call’ we made (i.e. what inputs we gave, so we can remember how we did it later on)
- The ‘coefficients’ of the model. These are the numbers which represent the line on the graph that we saw above.
Understanding the coefficients
In this example, we have two coefficients: the (Intercept)
which is 38.3696 and the work_hours
coefficient which is 0.8377.
The best way to think about the coefficients is in relation to the plot I showed in the workshop:
We can interpret the coefficients as follows:
The
(Intercept)
coefficient is the height (on the y axis) where the blue dotted line crosses zero (on the x-axis).The
work_hours
coefficient is how steep the slope of the line is. Specifically, is says how manygrade
points the line will rise if we increasework_hours
by 1.
(Watch the previous video on lm
again for extra clarification/interpretation of the coefficients).
Using different predictors
The studyhabits
data has other variables we could use to predict grades.
Run a number of different regression models using a different predictor each time.
Which of the questions produces the largest slope coefficient?
If you could intervene to change participants, which variable would you like to alter most? (Assuming you wanted to increase test scores)
First 2 questions:
# an example of another model
lm(grade ~ work_consistently, data=studyhabits)
# being an MSc student made the biggest difference!
lm(grade ~ msc_student, data=studyhabits)
Answering question 3 is much harder. In fact there is no simple right answer here.
We might say that we’d want to make all students into MSc students. After all, that was the variable with the largest coefficient, so changing that notionally has the biggest impact on grades.
But that ignores the fact that being an MSc student isn’t likely to cause you to have higher grades. It’s probably an example of confounding: students doing the MSc have been selected to some degree because of their ability to get higher grades on previous modules.
To make a decision between the variables we’d also have to consider the scale of the variables. If one of the predictors ranges between 0 and 7 (for example) whereas another ranges from 0 to 100 then that changes how we interpret the coefficients. A one point change on a 0-7 scale is ‘bigger’ than a one point change on a 0-100 scale. We’ll return to this problem in the final workshop.
Making predictions
By hand
A big advantage of using the coefficients alongside the plot is that we can easily make predictions for future cases.
- Let’s say we meet someone who works 30 hours per week. One way to predict would be by-eye, using the line on the plot.
What grade would you expect them to get simply by ‘eye-balling’ the line?
We should expect a grade just over 62%. To see this find 30 on the x axis, and track upwards until you hit the blue line. Then trace left until you see what score is on the y axis. This is the prediction the line makes.
Using coefficients
We can do the same thing using the coefficients from lm
, because we know that:
If someone worked for 0 hours per week then our prediction would be 38.37, because this is the intercept value (the point on the line when it is at zero on the x-axis).
For each extra hour we study the
work_hour
coefficient tells us that ourgrade
will increase by 0.84.So, if we study for 30 hours, our prediction is \(38.37 + 0.84 \times 30 = 63.57\)
Using the coefficients from the model lm(grade ~ work_hours, data = studyhabits)
:
- Make predictions for people who study 5, 20 or 40 hours per week
- Compare these predictions to the plot above.
- Which of the predictions (for 5, 20 or 40 study hours) should we be most confident in? Why is this?
You should get something like: 42.6, 55.1 and 71.9
Don’t worry about rounding errors… within 1 point is fine.
We should be most confident about the prediction for 20 hours because we have more data in the sample which is close to that value. Our line was estimated from data which didn’t have many people who worked 5 or 40 hours, so we don’t really know about what happens at those extremes.
Using augment
Rather than making predictions by hand, we save time by using the augment()
function.
augment()
takes a model, and returns the dataset used to fit it, plus new columns for the model predictions and residuals.
Here’s how it works: If we run a regression, we should save the fitted model with a named variable:
lm(grade ~ work_hours, data = studyhabits) first.model <-
If we feed this model to the augment
function, we see a copy of the studyhabits
data, with added columns for the model predictions (and some other things).
augment()
is in the broom
package so we have to load that first.
library(broom)
augment(first.model) %>% glimpse
Rows: 300
Columns: 9
$ grade <dbl> 65, 57, 45, 70, 55, 50, 57, 69, 62, 63, 75, 67, 62, 63, 45, 64, 57, 68, 66, 21, 62, 64, 61, 52, 61…
$ work_hours <dbl> 26, 26, 16, 33, 25, 20, 32, 31, 22, 33, 22, 28, 29, 26, 18, 26, 31, 28, 23, 18, 28, 27, 23, 26, 29…
$ .fitted <dbl> 60.15051, 60.15051, 51.77323, 66.01460, 59.31278, 55.12414, 65.17687, 64.33915, 56.79960, 66.01460…
$ .se.fit <dbl> 0.5132544, 0.5132544, 1.0361769, 0.9700008, 0.5003213, 0.7060814, 0.8839487, 0.8017151, 0.5800900,…
$ .resid <dbl> 4.8494933, -3.1505067, -6.7732287, 3.9853988, -4.3127789, -5.1241399, -8.1768734, 4.6608544, 5.200…
$ .hat <dbl> 0.003510245, 0.003510245, 0.014306713, 0.012537653, 0.003335569, 0.006643265, 0.010411808, 0.00856…
$ .sigma <dbl> 8.672905, 8.675552, 8.668451, 8.674364, 8.673864, 8.672355, 8.664367, 8.673233, 8.672213, 8.675699…
$ .cooksd <dbl> 5.538940e-04, 2.337732e-04, 4.500801e-03, 1.360695e-03, 4.161295e-04, 1.177755e-03, 4.736241e-03, …
$ .std.resid <dbl> 0.56078448, -0.36431749, -0.78751869, 0.46296434, -0.49867632, -0.59347768, -0.94884671, 0.5403428…
Explanation of the output: The augment
function used in line 1 has made a prediction for each row and added it to the original dataset used to fit the model. Alongside the .fitted
value, and the .resid
(residual) there are some other columns we can ignore for now.
Try using augment
to make predictions from one of the regression models you have already run.
Remember, you will need to save the lm
result to a new variable before you can use the model with these functions.
For new samples
Often we don’t want a prediction for each row in the original dataset. Rather, we want predictions for specific values of the predictor variable.
To do this, we can use the newdata
argument to augment
.
First, we create a new single-row dataframe which contains the new predictor values we want a prediction for:
tibble(work_hours=30)
newsamples <- newsamples
# A tibble: 1 x 1
work_hours
<dbl>
1 30
Explanation of the code The tibble
command makes a new dataframe for us (a tibble is a special kind of dataframe). By writing work_hours=30
we have made a single-row dataframe with a single column called work_hours
. We saved this under the name newsamples
.
We can use this new dataframe with the augment
function:
augment(first.model, newdata=newsamples)
# A tibble: 1 x 3
work_hours .fitted .se.fit
<dbl> <dbl> <dbl>
1 30 63.5 0.725
Explanation of the output: We have a new data frame with a prediction for a new person who worked 30 hours.
Create a new single-row dataframe to make a prediction (using augment
) for someone who worked 33 hours per week.
Consolidation
This video walks through the steps covered today:
- Run a regression model to predict
grade
using different columns in thestudyhabits
data. - Once you have run a model, describe in words what each of the coefficients mean (write this down)
- Talk your explanation through with a TARA or member of staff to check your understanding
Joining data
In the activities above we saw that to run a regression we need at least 2 columns:
- the outcome variable and
- the predictor variable(s).
Previously, you recoded and created a summary variable for one of the 3 chosen questionnaires. This provides the predictor.
To predict the MCQ score (our outcome measure) we would need to combine two sources of data.
Combining two datasets
Imagine we have two datasets like this, one containing identifying information:
# A tibble: 3 x 2
Identifier Name
<chr> <chr>
1 123ABC Ben
2 456DEF Helen
3 678FGH Esther
And a second dataset containing sensitive questionnaire data:
# A tibble: 3 x 2
Identifier `Worst ever record purchase`
<chr> <chr>
1 123ABC Europe, the final countdown
2 456DEF Spice Girls Wannabe
3 678FGH One Direction - Night changes
Note that the
Identifier
column is common to both datasets.
We can use a new tidyverse
function which joins these two datasets into a combined table:
left_join(personal_data, research_data, by="Identifier")
# A tibble: 3 x 3
Identifier Name `Worst ever record purchase`
<chr> <chr> <chr>
1 123ABC Ben Europe, the final countdown
2 456DEF Helen Spice Girls Wannabe
3 678FGH Esther One Direction - Night changes
Explanation of left_join
The left_join
function takes two dataframes as input. We call the first input the ‘left hand side’, and the second input the ‘right hand side’.
Both left and right sides have to share at least one column. In this case it’s called Identifier
.
When values in Identifier
match in the left and right hand side then left_join
copies the extra columns from the right hand side into the left hand side.
It returns the combined dataframe.
Any extra rows in the right hand side which don’t match the left hand side are dropped. This might happen if we have research data for people we don’t have personal data for. There are other functions like full_join
which do this a bit differently, but we don’t need them for now.
Have a look at the
heroes_meta
andheroes_personal
datasets, inpsydata
.Adapt the code for
left_join()
shown above to combine these datasets.Pipe the result of
left_join()
tocount()
to check how many rows are in the final datasetDoes it matter which order you join the datasets in? Can you explain the number of rows in the resulting dataset?
Save the combined dataset into a new csv file using
write_csv()
. Check you can find this in the RStudio files pane, and download a copy to be sure.
Do the join:
left_join(heroes_meta, heroes_personal)
name publisher gender eye_color race height weight
1 Giganta DC Comics Female green - 62.5 630
2 Black Goliath Marvel Comics <NA> <NA> <NA> NA NA
3 Spider-Man Marvel Comics - red Human 178.0 77
4 Spider-Man Marvel Comics Male brown Human 157.0 56
5 Spider-Man Marvel Comics Male hazel Human 178.0 74
6 Superboy-Prime DC Comics Male blue Kryptonian 180.0 77
7 Lady Bullseye Marvel Comics Female - - NA NA
8 Black Canary DC Comics Female blue Human 165.0 58
9 Black Canary DC Comics Female blue Metahuman 170.0 59
10 Black Lightning DC Comics Male brown - 185.0 90
11 X-23 Marvel Comics Female green Mutant / Clone 155.0 50
12 Silverclaw Marvel Comics Female brown - 157.0 50
13 Hiro Nakamura NBC - Heroes Male - - NA NA
14 Beta Ray Bill Marvel Comics <NA> <NA> <NA> NA NA
15 Franklin Storm Marvel Comics - blue - 188.0 92
16 Katniss Everdeen Female - Human NA NA
17 Colossal Boy DC Comics Male - - NA NA
18 Sage Marvel Comics Female blue - 170.0 61
19 Abomination Marvel Comics Male green Human / Radiation 203.0 441
20 Thing Marvel Comics <NA> <NA> <NA> NA NA
21 Red Robin DC Comics Male blue Human 165.0 56
22 Deadman DC Comics <NA> <NA> <NA> NA NA
23 Cyborg Superman DC Comics Male blue Cyborg NA NA
24 Paul Blart Sony Pictures <NA> <NA> <NA> NA NA
25 Multiple Man Marvel Comics Male blue - 180.0 70
26 Siren DC Comics Female blue Atlantean 175.0 72
27 Stardust Marvel Comics Male - - NA NA
28 Savage Dragon Image Comics Male - - NA NA
29 Parademon DC Comics - - Parademon NA NA
30 Valkyrie Marvel Comics Female blue - 191.0 214
31 Magneto Marvel Comics Male grey Mutant 188.0 86
32 Kang Marvel Comics Male brown - 191.0 104
33 Black Widow II Marvel Comics Female blue - 170.0 61
34 Boomer Marvel Comics Female - - NA NA
35 Ink Marvel Comics Male blue Mutant 180.0 81
36 Arclight Marvel Comics Female violet - 173.0 57
37 Spider-Woman II Marvel Comics Female - - NA NA
38 Green Arrow DC Comics Male green Human 188.0 88
39 Firelord Marvel Comics - white - 193.0 99
40 Spider-Woman IV Marvel Comics Female red - 178.0 58
41 Wondra Marvel Comics Female - - NA NA
42 Cogliostro Image Comics Male - - NA NA
43 Guy Gardner DC Comics Male blue Human-Vuldarian 188.0 95
44 Jennifer Kale Marvel Comics <NA> <NA> <NA> NA NA
45 Han Solo George Lucas Male brown Human 183.0 79
46 Vibe DC Comics Male brown Human 178.0 71
47 Doc Samson Marvel Comics Male blue Human / Radiation 198.0 171
48 Riddler DC Comics Male - - NA NA
49 Hawkeye Marvel Comics Male blue Human 191.0 104
50 Brainiac DC Comics Male green Android 198.0 135
51 Cable Marvel Comics Male blue Mutant 203.0 158
52 Blue Beetle DC Comics Male - - NA NA
53 Drax the Destroyer Marvel Comics Male red Human / Altered 193.0 306
54 Skaar Marvel Comics Male green - 198.0 180
55 Nite Owl II DC Comics Male - - NA NA
56 Purple Man Marvel Comics Male purple Human 180.0 74
57 Greedo George Lucas Male purple Rodian 170.0 NA
58 Darth Vader George Lucas Male yellow Cyborg 198.0 135
59 Vulture Marvel Comics Male brown Human 180.0 79
60 Rocket Raccoon Marvel Comics Male brown Animal 122.0 25
61 Plastic Man DC Comics Male blue Human 185.0 80
62 Evilhawk Marvel Comics Male red Alien 191.0 106
63 JJ Powell ABC Studios Male - - NA NA
64 Thunderbird II Marvel Comics Male - - NA NA
65 Sunspot Marvel Comics Male brown Mutant 173.0 77
66 Shriek Marvel Comics Female yellow / blue - 173.0 52
67 Marvel Girl Marvel Comics Female green - 170.0 56
68 Destroyer Marvel Comics Male - - 188.0 383
69 Overtkill Image Comics Male - - NA NA
70 Garbage Man DC Comics Male - Mutant NA NA
71 Chamber Marvel Comics Male brown Mutant 175.0 63
72 Batman II DC Comics Male blue Human 178.0 79
73 Sabretooth Marvel Comics Male amber Mutant 198.0 171
74 Wyatt Wingfoot Marvel Comics Male brown - 196.0 117
75 Namorita Marvel Comics Female blue - 168.0 101
76 Mister Mxyzptlk DC Comics Male - God / Eternal NA NA
77 Bane DC Comics Male - Human 203.0 180
78 Metron DC Comics <NA> <NA> <NA> NA NA
79 Red Arrow DC Comics Male green Human 180.0 83
80 Allan Quatermain Wildstorm Male - - NA NA
81 Swarm Marvel Comics Male yellow Mutant 196.0 47
82 Steel DC Comics Male brown - 201.0 131
83 Hawkwoman III DC Comics Female blue - 170.0 65
84 Blizzard Marvel Comics Male - - NA NA
85 Blizzard Marvel Comics Male - - NA NA
86 Warlock Marvel Comics Male red - 188.0 108
87 Black Knight III Marvel Comics <NA> <NA> <NA> NA NA
88 Deadshot DC Comics Male brown Human 185.0 91
89 King Kong Male yellow Animal 30.5 NA
90 Sif Marvel Comics Female blue Asgardian 188.0 191
91 Tempest Marvel Comics Female brown - 163.0 54
92 Captain Planet Marvel Comics Male red God / Eternal NA NA
93 Beetle Marvel Comics Male - - NA NA
94 Professor X Marvel Comics Male blue Mutant 183.0 86
95 Blue Beetle III DC Comics Male brown Human NA NA
96 Fixer Marvel Comics - red - NA NA
97 Medusa Marvel Comics Female green Inhuman 180.0 59
98 Clock King DC Comics Male blue Human 178.0 78
99 Cat Marvel Comics Female blue - 173.0 61
100 Morph Marvel Comics Male white - 178.0 79
101 Donatello IDW Publishing Male green Mutant NA NA
102 T-1000 Dark Horse Comics Male - Android 183.0 146
103 Flash II DC Comics <NA> <NA> <NA> NA NA
104 Man of Miracles Image Comics - blue God / Eternal NA NA
105 Thanos Marvel Comics Male red Eternal 201.0 443
106 Exodus Marvel Comics Male blue Mutant 183.0 88
107 Ultragirl Marvel Comics <NA> <NA> <NA> NA NA
108 Archangel Marvel Comics Male blue Mutant 183.0 68
109 Ra's Al Ghul DC Comics Male green Human 193.0 97
110 Nova Marvel Comics Male brown Human 185.0 86
111 Yellowjacket II Marvel Comics Female blue Human 165.0 52
112 Micah Sanders NBC - Heroes Male brown - NA NA
113 Silk Spectre II DC Comics Female - - NA NA
114 Jigsaw Marvel Comics Male blue - 188.0 113
115 Ms Marvel II Marvel Comics Female blue - 173.0 61
116 Leech Marvel Comics Male - - NA NA
117 Junkpile Marvel Comics Male - Mutant NA NA
118 Husk Marvel Comics <NA> <NA> <NA> NA NA
119 Silk Marvel Comics Female brown Human NA NA
120 Bumblebee DC Comics Female brown Human 170.0 59
121 Mephisto Marvel Comics Male white - 198.0 140
122 Yoda George Lucas Male brown Yoda's species 66.0 17
123 Thunderbird III Marvel Comics <NA> <NA> <NA> NA NA
124 Mister Fantastic Marvel Comics Male brown Human / Radiation 185.0 81
125 Cottonmouth Marvel Comics Male brown Human 183.0 99
126 Mantis Marvel Comics Female green Human-Kree 168.0 52
127 Nebula Marvel Comics Female blue Luphomoid 185.0 83
128 Question DC Comics Male blue Human 188.0 83
129 Stacy X Marvel Comics Female - - NA NA
130 Deathlok Marvel Comics Male brown Cyborg 193.0 178
131 Quill Marvel Comics Male brown - 163.0 56
132 Songbird Marvel Comics Female green - 165.0 65
133 Gravity Marvel Comics Male blue Human 178.0 79
134 Hobgoblin Marvel Comics Male blue - 180.0 83
135 Cameron Hicks SyFy Male - Alpha NA NA
136 Bill Harken SyFy Male - Alpha NA NA
137 Big Man Marvel Comics Male blue - 165.0 71
138 Fabian Cortez Marvel Comics - blue - 196.0 96
139 Ymir Marvel Comics <NA> <NA> <NA> NA NA
140 Ghost Rider II Marvel Comics - - - NA NA
141 Mandarin Marvel Comics Male blue Human 188.0 97
142 Ant-Man Marvel Comics Male blue Human 211.0 122
[ reached 'max' / getOption("max.print") -- omitted 547 rows ]
Count the rows:
left_join(heroes_meta, heroes_personal) %>% count()
n
1 689
left_join(heroes_personal, heroes_meta) %>% count()
n
1 691
We get slightly different numbers depending on which order we join the data in.
This is because left_join
only keeps rows in the right hand dataset IF they have a match in the left hand side.
If the number of rows in the left and right hand sides differ, you can get a different result.
Other functions like inner_join
and full_join
take different approaches. Type ?dplyr::inner_join
for an explanation.
This would export a csv file to the same location as your rmd
file:
left_join(heroes_meta, heroes_personal) %>% rio::export('combined.csv')