Overview

This worksheet will make most sense if you caught the mini-lecture at the start. If you are coming to this later, try to catch up with the recording of that part on Panopto.

Loose ends (in R)
How to report regression
Practice task

Loose ends

To present our regression models, I covered a few loose ends in the workshop. In particular we talked about:

using summarise to calculate summary statistics
confidence intervals
standardized coefficients
printing numbers nicely (avoiding scientific notation)

The following section documents the code needed for those techniques:

Calculating summary statistics

We have already covered this in some detail in session 3, on grouped data, so some of this code should be familiar.

We already know how to calculate summary statistics like this:

studyhabits %>% 
  summarise(mean(work_hours), mean(work_consistently), mean(grade))

In the workshop I showed a short-cut technique to summarise multiple variables at once. This isn’t strictly necessary, but can be a time saver.

studyhabits %>% 
  select(-unique_id) %>% 
  summarise_all(list(Mean=mean, SD=sd)) %>% 
  pivot_longer(everything()) %>% 
  arrange(name)

Explanation of the code We can break this down, line by line:

First we take the studyhabits data, and remove the unique_id column using the select function.
Next we use the summarise_all function to calculate statistics on all of these variables in one go.
The part which says list(Mean=mean, SD=sd) tells R which functions to use. We could add the median by writing list(Mean=mean, Median=median, SD=sd)
pivot_longer(everything()) reshapes the result from a very thin/wide table to a long-form one
arrange(name) sorts the result

Try running each step of the code for yourself. See what happens at each stage.

Another trick which can be useful is counting the number of responses to each option of a question. For example:

studyhabits %>% 
  group_by(work_consistently) %>% 
  count()

Explanation of the code: We group_by the different ‘levels’ or options of the work_consistently question. The options ranged from 1-5, so the count() function then tells us how many of each response there were.

Try this for yourself
Adjust the group_by to see the figures for men and women separately

This code breaks down the table by gender:

studyhabits %>% 
  group_by(female, work_consistently) %>% 
  count()

Confidence intervals

Once we have run a regression model, we may want confidence intervals for each coefficient. We can obtain these with tidy() in the broom package:

library(broom)

m1 <- lm(grade ~ work_hours, data=studyhabits)
m1 %>% tidy(conf.int=TRUE)

Explanation of the output:

We add the conf.int=TRUE option to include confidence intervals
The estimate column is the coefficient value, as you would see in the normal lm output
The columns conf.low and conf.high represent the lower and upper bounds of the confidence interval for each coefficient.
Ignore the other columns (p.value etc).

Standardized coefficients

As discussed in the introduction, standardizing coefficients can be helpful to enable us to compare:

Different predictors in the same model, and
The same predictors in different models (e.g. when the outcome measure changes)

The easiest way to standardize the coefficients from the model is to use the arm::standardize() function (in the arm package).

First we fit the model as usual:

m1 <- lm(grade ~ work_hours, data=studyhabits)

Then we have 2 options:

Standardize the predictor variables only
Standardize the predictors AND the outcome variable

Option 1: Standardize only the predictors

This code refits our model, but standardizes the predictor variable:

m1 %>% arm::standardize()


Call:
lm(formula = grade ~ z.work_hours, data = studyhabits)

Coefficients:
 (Intercept)  z.work_hours  
      59.207         8.208

Explanation of the output: The output of our model changes subtly: Instead of work_hours our coefficient is now called z.work_hours.

Explanation of the standardized coefficients marked with “z.”

We can interpret the value, \(8.208\) as the change in grade % we would see for a 1 standard-deviation change in work_hours.
The standard deviation of work_hours is 4.9.
This means that we expect an grade increase of 8.2% for every 4.9 hour increase in work.
Our predictions remain the same — only the units the coefficients are expressed in have changed

Option 2: Standardize both the predictors AND the outcome

This code refits our model, but standardizes both the predictor variable AND the outcome:

m1 %>% arm::standardize(standardize.y=TRUE)


Call:
lm(formula = z.grade ~ z.work_hours, data = studyhabits)

Coefficients:
 (Intercept)  z.work_hours  
  -1.409e-16     4.287e-01

Explanation of the output: The output of our model again changes subtly: This time we have both z.work_hours and z.grade named in the model. The z is a reference to ‘z-scores’, which is another term for standardized scores.

Explanation of the standardized coefficients:

We can interpret the value, 4.287e-01 as \(0.4287\) (it was in scientific notation)
This is the number of standard-deviation changes in grade we expect for a 1 standard-deviation change in work_hours.
The SD of work_hours is 4.9.
The SD of grade is 9.57
This means that we expect an grade increase of \(9.57 \times 0.4287\) for every 4.9 hours increase in work.

Which type of standardizing is better?

Neither is better/worse. In psychology it’s common to standardize both predictors and outcomes, and to report these standardized coeffients as ‘betas’, or use the greek symbol β.

Printing numbers nicely (avoiding scientific notation)

In calculating the standardized coefficients, R reported the answers using scientific notation. This can be annoying.

To fix this we can use the scipen option:

options(scipen=20)
m1 %>% 
  arm::standardize(standardize.y=TRUE)


Call:
lm(formula = z.grade ~ z.work_hours, data = studyhabits)

Coefficients:
           (Intercept)            z.work_hours  
-0.0000000000000001409   0.4286972460463127632

Explanation of the option name: scipen is short for scientific notation penalty. It only allows R to use scientific notation if a number is really very small, or very large. If we set scipen=20, that means a number would have to have 20 significant digits (either before or after the decimal place) before R would use the e+20 type notation.

This change persists, so you may also want to set R back to normal afterwards:

options(scipen=0)

The reporting “formula”

Scientific journals are extremely formulaic; study analyses and results are reported in a very consistent way. This makes it relatively easy to know what to put where in your report.

For example, the journal Psychological Science publishes detailed guidance on the content of each section of manuscripts acceptable for publication. There are small differences between journals but you will always need:

A statement of your hypotheses [in the introduction]
A summary of your analytic approach [methods]
A description of the study sample [results]
An exact description of the analyses you ran [results]
A table describing the outputs of the model [results]
A description of the direction/pattern of the effects found [results]
A re-statement of what the main finding was [discussion]

This document regression-reporting-example-studyhabits.pdf contains a simplified worked example, along with an explanation of the different parts of the report.

Read the regression-reporting-example-studyhabits.pdf document.

Work with a partner to discuss the document.

Collect all the R code you would need to calculate each of the numbers in this document (it should all be contained in this or previous worksheets). You might also like to use the cheat-sheets/reference guides here.
Make a new Rmd file for this code. Include comments and extra text to explain each part.
Try pressing the knit button to convert your rmd to and html file. You may need to debug errors in the code to make this work, so ask the TARAs for support here if needed.

Practice

Use the following variables from the mentalh dataset in the psydata package

mentalh %>% 
  select(anxiety_score, tv, computer) %>% 
  glimpse()

Rows: 528
Columns: 3
$ anxiety_score <dbl> 7, 10, 13, 13, 3, 2, 1, 6, 1, 10, 19, 4, 7, 9, 9, 8, 7, 9, 1, 2, 4, 5, 3, 9, 16…
$ tv            <dbl> 1.2142860, 0.7857143, 1.5000000, 4.5000000, 8.5714280, 0.6428572, 1.8571430, 0.…
$ computer      <dbl> 1.3571430, 0.6428572, 2.7142860, 2.7857140, 10.0000000, 0.8571428, 4.0000000, 1…

Run analyses to answer the question: “does both time spent on the computer and time spent watching tv predict anxiety scores, in this sample?”
Report your results in a format which matches the example provided as closely as possible.
Document all of your calculations in a new Rmd document.
Include comments and explanations in this document as needed. Imagine yourself returning to the project in 6 months time: would it be easy to pick up where you left-off? If not, help your future self out!

Reporting results

Ben Whalley, Sonja Heintz