In brief

Reporting our results a is key part of the research cycle. We need to aim to report findings fairly, and set them in a meaningful context. Our goal is not to make results as impressive as possible: you should aim for a balanced summary of the data which another researcher could and would replicate from your published report.

An important task is to quantify how much evidence the model provides. In recent years, null-hypothesis significance tests (NHST) have fallen out of favour among (although there is a strong inertia in research practice). Courses at Plymouth explain NHST to enable you to read the existing literature. But we teach Bayes Factors and other Bayesian techniques which enable you to quantify the strength of the evidence in favour, or against, a hypothesis of interests.

Before you start

  • Make a new .rmd file for today, called reporting.rmd

  • If you haven’t already done so, make another new rmd file for your work on the regression assessment. Call it regression-assessment.rmd or similar.

How good?

We said we could ask two questions about how good our regression models are:

  1. How much variance (variability) in the outcome do the predictors explain?
  2. How much more probable is the model which includes our predictors, than a model without them?

To answer question 1 we can check what proportion of variance is explained byt the model using \(R^2\) (see the previous session).

The second question asks how probable different models are. By ‘probable’ we mean:

How likely is it that our model is ‘correct’, given the data we have observed.

For various reasons, it’s very hard to actually calculate that probability for an individual model. We also know that ALL models are a simplification of reality, so their absolute probability is not always that important.

However, it’s much easier to compare two different models, and ask the slightly different question:

How much more likely is my model than some alternative model?

If we choose suitable alternative models for comparison this can be a powerful tool.

For example, we can compare models with and without certain predictors included, and this gives us a sense of how much evidence we have that they help predict the outcome.

This kind of comparison is expressed using a Bayes Factor.

Bayes Factors

Was worth running our model at all? Maybe our predictors are no good and don’t really help us explain the outcome? A Bayes Factor quantifies how likely that is.

A Bayes Factor can compare the probabilities of different hypotheses if those hypotheses are represented by statistical models, for example:

‘No effects’ hypothesis, \(H_0\)

The predictors are NOT related to the outcome (except by chance), v.s.

Experimental hypothesis, \(H_1\)

The predictors are related to the outcome

Use the BayesFactor package

As you have previously, we will use the BayesFactor package for this comparison. First we load the library and ignore all the warning messages:

library(BayesFactor)

Next, use the lmBF function (instead of lm) to re-run our multiple regression:

lmBF(grade ~ work_hours * female, data = studyhabits)
Bayes factor analysis
--------------
[1] work_hours * female : 1.084058e+13 ±0%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

Explanation of the command: We replaced lm with the lmBF command. This re-ran our regression model and computed a Bayes Factor.

Explanation of the output:

  • Ignore the warning message about data being ‘coerced’ - it’s not as bad as it sounds.
  • Look for the line which says something like [1] work_hours * female : 1.084058e+13.
  • This is the Bayes Factor for the comparison of your model against a simpler model with no predictors at all. That is, a model which just computes the average of all outcome scores.
  • This is sometimes called the ‘Intercept only’ model.

Interpreting

The output above uses scientific notation, but the Bayes Factor (BF) is VERY large. If written out in full the BF displayed (1.084058e+13) is 10840582601481.

This means you have very strong evidence that your model is better than no model at all. Some conventional thresholds and interpretations are:

  • BF > 3: Evidence
  • BF > 10: Strong Evidence
  • BF > 30: Overwhelming evidence

IMPORTANTLY, this only says how much evidence there is that this statistical model is better than no model at all. That’s not a very high bar to pass!

SPECIFICALLY: A BF > 3 doesn’t prove that your psychological model is correct or true. To build a case for that you might want to compare this model with other more complex or ‘reasonable’ alternatives. But that needs to wait for a future module!

If you want, you can also refer back to the stage 1 materials on evidence which have more explanation of Bayes Factors.

  • Using lmBF run a model which has grade as the outcome and work_hours as the predictor. Don’t include female just yet.

  • What is the Bayes Factor for this model, compared with the Intercept only model?

lmBF(grade ~ work_hours, data = studyhabits)
Bayes factor analysis
--------------
[1] work_hours : 726159159273 ±0.01%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

The Bayes Factor is 7.2615916^{11}, in favour of the hypothesis that adding work_hours as a predictor is helpful in explaining grades.

This is a very large number, so we can be fairly sure work_hours helps predict grades.

Reporting

If you have completed the exercise above you could report your findings by saying something like:

We found strong evidence that a model including work hours predicted grades (adjusted \(R^2\) = 0.15; BF for comparison with the intercept-only model > 10000).

Testing interactions

The BF we just calculated only tested whether work hours was a useful predictor. It didn’t include gender as a predictor.

We can also add female using lmBF:

lmBF(grade ~ work_hours * female, data = studyhabits)
Bayes factor analysis
--------------
[1] work_hours * female : 1.084058e+13 ±0%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

Explanation of the code: We added-back female and the interaction of work_hours and female (see the previous session for a reminder about adding multiple predictors)

Explanation of the output:

  • When we added the interaction the Bayes Factor got bigger: 1.0840583^{13}, vs 1.6874571^{11} before.

  • This suggests that the model with female and the interaction is more likely than the model without


We want to compare these two hypotheses.

  • \(H_0\): The model including work_hours, and a difference for female, is the best
  • \(H_1\): The model including work_hours and difference for female and their interaction is the best

A neat think about Bayes Factors is that you are not restricted to comparing against the intercept only model. You can compare any 2 models (which use the same outcome and dataset) by dividing their Bayes Factors.


We do this by simply dividing the Bayes Factors for each model

  • BF10 = \(\dfrac{BF(H_1)}{BF(H_0)}\) (> 3 favours \(H_1\))

  • BF01 = \(\dfrac{BF(H_0)}{BF(H_1)}\) (> 3 favours \(H_0\))

So in this case:

  • BF10 = \(\dfrac{1.084058e+13}{1.687457e+11} = 64.24\)

And

  • BF01 = \(\dfrac{1.687457e+11}{1.084058e+13} = 0.02\)

Explanation of the result: The BF10 is 64.24, which means we have very strong evidence in favour of the interaction.

  • Calculate the Bayes Factor for a model which includes work_consistently and female to predict grade, but DOES NOT include their interaction. Call this model \(H_0\)

  • Use the Bayes Factor from the output above for the same model but INCLUDING the interaction. Call this model \(H_1\)

  • Calculate the BF10 for these two models and interpret your result.

We calculate the BF for \(H_0\) like this:

lmBF(grade ~ work_consistently + female, data = studyhabits)
Bayes factor analysis
--------------
[1] work_consistently + female : 2838750 ±0%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

And for \(H_1\) like this:

lmBF(grade ~ work_consistently * female, data = studyhabits)
Bayes factor analysis
--------------
[1] work_consistently * female : 554911.9 ±0.01%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

So, the BF for the model WITHOUT the interacton is 2.8387498^{6}.

  • We already saw (above) that the BF for \(H_1\) (the model with the interaction) = 5.549119^{5}

  • We calculate the BF10 by dividing these Bayes Factors

  • BF10 = 5.549119^{5} / 2.8387498^{6} = 0.1954776

  • BF01 = 2.8387498^{6} / 5.549119^{5} = 5.1156767

We can interpret these Bayes Factors as substantial evidence AGAINST the the interaction (see this table).

Reversing a Bayes Factor

Sometimes, you’ll see Bayes Factor written as BF10, which means the same thing as just plain ‘BF’.

Occasionally you will see BF01, which is the same idea but flipped. BF01 is the inverse of BF10: it provides evidence in favour of \(H_0\).

BF01 > 3 means substantial evidence for \(H_0\) (where \(H_0\) is often the ‘null’ hypothesis of no-difference, or no-relationship).


Flipping a BF from a BF01 to a BF10 (or the reverse) is easy. Just calculate:

\(\dfrac{1}{BF}\)


So, if your BF01 = 0.2, then your BF10 would just be \(1/0.2 = 5\)

Among us

Imagine we had data where students who reported how much Among Us they played each day. We could plot the effect on their achievement like this:

As we have seen before, the lines on this plot can be represented by the coefficients in a linear model.

Our model would include a continuous variable to measure hours playing (gameplay), a categorical variable (gender), and the interaction of these two variables:


Call:
lm(formula = score ~ gameplay * gender, data = amongus)

Coefficients:
          (Intercept)               gameplay           genderFemale  gameplay:genderFemale  
               25.403                 -1.703                  2.260                 -2.617  

Explanation of the output:

  • In previous examples we included a variable for gender which was called female and coded 0/1 or FALSE/TRUE for men/women respectively.
  • In this example we have included a variable called gender which has the text "Male" or "Female" instead.
  • The coefficients returned are the same as they would be with a 0/1 coding, but the names of the coefficients are different
  • For example genderFemale now relates to the differences in intercepts between men and women, and gameplay:genderFemale now relates to the difference in slopes.

Check you can identify which coefficients in the output above relate to the blue and red slopes in the plot.

(You don’t need to run this model yourself because the coefficients are above, but if you wanted to play with this dataset you can see them in the psydata package, called amongus

Enter values in the boxes below to check your understanding (they turn blue when the answer is correct):

  • The change in score of each unit change in gameplay for men

  • The estimate for men who play zero hours:

  • The estimate for women who play zero hours:

  • The change in score of each unit change in gameplay for women:

  • The prediction for a man who plays 2.5 hours of the game:

  • The prediction for a woman who plays 2.5 hours of the game:

HINT: you can enter calculations in the boxes above; e.g. try typing in "2+2"

The coefficients from the model were:

Coefficient Estimate
(Intercept) 25.4
gameplay -1.7
genderFemale 2.26
gameplay:genderFemale -2.62

This means (predicting for people who played for 2.5 hours):

  • For men, the prediction is 25.4 + 2.5 \(\times\) -1.7
  • For women, the prediction is 25.4 + 2.26 + 2.5 \(\times\) -1.7 + 2.5 \(\times\) -2.62

Testing the interaction

As we saw above, if you run a model with an interaction term (i.e. the formula has a "*" in it) then you will need to to test that interaction.

When we say “test that interaction” we are asking the question:

“does a model that includes an interaction term work better than one which does not?”

or

“do we really need separate slopes for men and women to explain these data?”

Comparing the ‘right’ models

We are interested in whether men and women had different slopes.

That is, does the relationship between gameplay and score differ between men and women?

  • Run two models using lmBF and the amongus dataset in psydata:

    1. Predicting score from gameplay and gender with no interaction
    2. Predicting score from gameplay and gender WITH an interaction
  • Store each model in a variable. Call these h0 and h1

  • On a new line, write h1 / h0 and run this code. The result is the BF10 in favour of the interaction; interpret your finding.

  • On another new line: adjust your code to calculate the BF01 instead. Check you understand how these two numbers are related.

h0 <- lmBF( score ~ gameplay + gender, data=amongus)
h1 <- lmBF( score ~ gameplay * gender, data=amongus)

# calculate the BF10
h1/h0
Bayes factor analysis
--------------
[1] gameplay * gender : 1.917127 ±1.28%

Against denominator:
  score ~ gameplay + gender 
---
Bayes factor type: BFlinearModel, JZS
# calculate the BF01
h0/h1
Bayes factor analysis
--------------
[1] gameplay + gender : 0.5216139 ±1.28%

Against denominator:
  score ~ gameplay * gender 
---
Bayes factor type: BFlinearModel, JZS

The BF10 is only 1.9, which means we have some evidence in favour of the interaction, but that it is still inconclusive.