Overview
This worksheet will make most sense if you caught the mini-lecture at the start. If you are coming to this later, try to catch up with the recording of that part on Panopto.
Loose ends (in R)
How to report regression
Practice task
Loose ends
To present our regression models, I covered a few loose ends in the workshop. In particular we talked about:
- using
summarise
to calculate summary statistics - confidence intervals
- standardized coefficients
- printing numbers nicely (avoiding scientific notation)
The following section documents the code needed for those techniques:
Calculating summary statistics
We have already covered this in some detail in session 3, on grouped data, so some of this code should be familiar.
We already know how to calculate summary statistics like this:
%>%
studyhabits summarise(mean(work_hours), mean(work_consistently), mean(grade))
mean(work_hours) mean(work_consistently) mean(grade)
1 24.87333 3.333333 59.20667
In the workshop I showed a short-cut technique to summarise multiple variables at once. This isn’t strictly necessary, but can be a time saver.
%>%
studyhabits select(-unique_id) %>%
summarise_all(list(Mean=mean, SD=sd)) %>%
pivot_longer(everything()) %>%
arrange(name)
# A tibble: 16 × 2
name value
<chr> <dbl>
1 female_Mean 0.827
2 female_SD 0.379
3 focus_deadline_Mean 2.85
4 focus_deadline_SD 0.876
5 grade_Mean 59.2
6 grade_SD 9.57
7 msc_student_Mean 0.47
8 msc_student_SD 0.500
9 progress_everyday_Mean 2.79
10 progress_everyday_SD 0.805
11 revision_before_Mean 3.36
12 revision_before_SD 0.853
13 work_consistently_Mean 3.33
14 work_consistently_SD 0.851
15 work_hours_Mean 24.9
16 work_hours_SD 4.90
Explanation of the code We can break this down, line by line:
- First we take the
studyhabits
data, and remove theunique_id
column using theselect
function. - Next we use the
summarise_all
function to calculate statistics on all of these variables in one go. - The part which says
list(Mean=mean, SD=sd)
tells R which functions to use. We could add the median by writinglist(Mean=mean, Median=median, SD=sd)
pivot_longer(everything())
reshapes the result from a very thin/wide table to a long-form onearrange(name)
sorts the result
- Try running each step of the code for yourself. See what happens at each stage.
Another trick which can be useful is counting the number of responses to each option of a question. For example:
%>%
studyhabits group_by(work_consistently) %>%
count()
# A tibble: 5 × 2
# Groups: work_consistently [5]
work_consistently n
<dbl> <int>
1 1 4
2 2 37
3 3 140
4 4 93
5 5 26
Explanation of the code: We group_by
the different ‘levels’ or options of the work_consistently
question. The options ranged from 1-5, so the count()
function then tells us how many of each response there were.
Try this for yourself
Adjust the
group_by
to see the figures for men and women separately
This code breaks down the table by gender:
%>%
studyhabits group_by(female, work_consistently) %>%
count()
# A tibble: 10 × 3
# Groups: female, work_consistently [10]
female work_consistently n
<lgl> <dbl> <int>
1 FALSE 1 1
2 FALSE 2 7
3 FALSE 3 16
4 FALSE 4 19
5 FALSE 5 9
6 TRUE 1 3
7 TRUE 2 30
8 TRUE 3 124
9 TRUE 4 74
10 TRUE 5 17
Confidence intervals
Once we have run a regression model, we may want confidence intervals
for each coefficient. We can obtain these with tidy()
in
the broom
package:
library(broom)
<- lm(grade ~ work_hours, data=studyhabits)
m1 %>% tidy(conf.int=TRUE) m1
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 38.4 2.59 14.8 1.55e-37 33.3 43.5
2 work_hours 0.838 0.102 8.19 7.68e-15 0.636 1.04
Explanation of the output:
- We add the
conf.int=TRUE
option to include confidence intervals - The
estimate
column is the coefficient value, as you would see in the normallm
output - The columns
conf.low
andconf.high
represent the lower and upper bounds of the confidence interval for each coefficient. - Ignore the other columns (
p.value
etc).
Standardized coefficients
As discussed in the introduction, standardizing coefficients can be helpful to enable us to compare:
- Different predictors in the same model, and
- The same predictors in different models (e.g. when the outcome measure changes)
The easiest way to standardize the coefficients from the model is to
use the arm::standardize()
function (in the
arm
package).
First we fit the model as usual:
<- lm(grade ~ work_hours, data=studyhabits) m1
Then we have 2 options:
- Standardize the predictor variables only
- Standardize the predictors AND the outcome variable
Option 1: Standardize only the predictors
This code refits our model, but standardizes the predictor variable:
%>% arm::standardize() m1
Call:
lm(formula = grade ~ z.work_hours, data = studyhabits)
Coefficients:
(Intercept) z.work_hours
59.207 8.208
Explanation of the output: The output of our model
changes subtly: Instead of work_hours
our coefficient is
now called z.work_hours
.
Explanation of the standardized coefficients marked with
“z.
”
- We can interpret the value, \(8.208\) as the change in grade % we would see for a 1 standard-deviation change in work_hours.
- The standard deviation of
work_hours
is 4.9. - This means that we expect an grade increase of 8.2% for every 4.9 hour increase in work.
- Our predictions remain the same — only the units the coefficients are expressed in have changed
Option 2: Standardize both the predictors AND the outcome
This code refits our model, but standardizes both the predictor variable AND the outcome:
%>% arm::standardize(standardize.y=TRUE) m1
Call:
lm(formula = z.grade ~ z.work_hours, data = studyhabits)
Coefficients:
(Intercept) z.work_hours
-1.281e-16 4.287e-01
Explanation of the output: The output of our model
again changes subtly: This time we have both z.work_hours
and z.grade
named in the model. The z
is a
reference to ‘z-scores’, which is another term for standardized
scores.
Explanation of the standardized coefficients:
- We can interpret the value,
4.287e-01
as \(0.4287\) (it was in scientific notation) - This is the number of standard-deviation changes in
grade
we expect for a 1 standard-deviation change inwork_hours
. - The SD of
work_hours
is 4.9. - The SD of
grade
is 9.57 - This means that we expect an grade increase of \(9.57 \times 0.4287\) for every 4.9 hours increase in work.
Which type of standardizing is better?
Neither is better/worse. In psychology it’s common to standardize both predictors and outcomes, and to report these standardized coeffients as ‘betas’, or use the greek symbol β.
Printing numbers nicely (avoiding scientific notation)
In calculating the standardized coefficients, R reported the answers using scientific notation. This can be annoying.
To fix this we can use the scipen
option:
options(scipen=20)
%>%
m1 ::standardize(standardize.y=TRUE) arm
Call:
lm(formula = z.grade ~ z.work_hours, data = studyhabits)
Coefficients:
(Intercept) z.work_hours
-0.0000000000000001281 0.4286972460463128742
Explanation of the option name: scipen
is short for scientific notation penalty
. It only allows R
to use scientific notation if a number is really very small, or
very large. If we set scipen=20
, that means a
number would have to have 20 significant digits (either before or after
the decimal place) before R would use the e+20
type
notation.
This change persists, so you may also want to set R back to normal afterwards:
options(scipen=0)
The reporting “formula”
Scientific journals are extremely formulaic; study analyses and results are reported in a very consistent way. This makes it relatively easy to know what to put where in your report.
For example, the journal Psychological Science publishes detailed guidance on the content of each section of manuscripts acceptable for publication. There are small differences between journals but you will always need:
- A statement of your hypotheses [in the introduction]
- A summary of your analytic approach [methods]
- A description of the study sample [results]
- An exact description of the analyses you ran [results]
- A table describing the outputs of the model [results]
- A description of the direction/pattern of the effects found [results]
- A re-statement of what the main finding was [discussion]
This document regression-reporting-example-studyhabits.pdf contains a simplified worked example, along with an explanation of the different parts of the report.
Read the regression-reporting-example-studyhabits.pdf document.
Work with a partner to discuss the document.
Collect all the R code you would need to calculate each of the numbers in this document (it should all be contained in this or previous worksheets). You might also like to use the cheat-sheets/reference guides here.
Make a new
Rmd
file for this code. Include comments and extra text to explain each part.Try pressing the
knit
button to convert yourrmd
to and html file. You may need to debug errors in the code to make this work, so ask the TARAs for support here if needed.
Practice
Use the following variables from the mentalh
dataset in
the psydata
package
%>%
mentalh select(anxiety_score, tv, computer) %>%
glimpse()
Rows: 528
Columns: 3
$ anxiety_score <dbl> 7, 10, 13, 13, 3, 2, 1, 6, 1, 10, 19, 4, 7, 9, 9, 8, 7, …
$ tv <dbl> 1.2142860, 0.7857143, 1.5000000, 4.5000000, 8.5714280, 0…
$ computer <dbl> 1.3571430, 0.6428572, 2.7142860, 2.7857140, 10.0000000, …
Run analyses to answer the question: “does both time spent on the computer and time spent watching tv predict anxiety scores, in this sample?”
Report your results in a format which matches the example provided as closely as possible.
Document all of your calculations in a new
Rmd
document.Include comments and explanations in this document as needed. Imagine yourself returning to the project in 6 months time: would it be easy to pick up where you left-off? If not, help your future self out!