Correlations

The base R cor() function provides a simple way to get Pearson correlations, but to get a correlation matrix as you might expect from SPSS or Stata it’s best to use the corr.test() function in the psych package.

Before you start though, plotting the correlations might be the best way of getting to grips with the patterns of relationship in your data. A pairs plot is a nice way of doing this:

airquality %>%
  select(-Month, -Day) %>%
  pairs

If we were satisfied the relationships were (reasonably) linear, we could also visualise correlations themselves with a ‘corrgram’, using the corrgram library:

library("corrgram")
Registered S3 method overwritten by 'seriation':
  method         from 
  reorder.hclust gclus
airquality %>%
  select(-Month, -Day) %>%
  corrgram(lower.panel=corrgram::panel.ellipse,
         upper.panel=panel.cor,
         diag.panel=panel.density)

A corrgram, showing pearson correlations (above the diagonal), variable distributions (on the diagonal) and ellipses and smoothed lines of best fit (below the diagnonal). Long, narrow ellipses denote large correlations; circular ellipses indicate small correlations.

Figure 5.1: A corrgram, showing pearson correlations (above the diagonal), variable distributions (on the diagonal) and ellipses and smoothed lines of best fit (below the diagnonal). Long, narrow ellipses denote large correlations; circular ellipses indicate small correlations.

The ggpairs function from the GGally package is also a nice way of plotting relationships between a combination of categorical and continuous data - it packs a lot of information into a limited space:

mtcars %>%
  mutate(cyl = factor(cyl)) %>%
  select(mpg, wt, drat, cyl) %>%
  GGally::ggpairs()

Creating a correlation matrix

The psych::corr.test() function is a quick way to obtain a pairwise correlation matrix for an entire dataset, along with p values and confidence intervals which the base R cor() function will not provide:

mycorrelations <- psych::corr.test(airquality)
mycorrelations
Call:psych::corr.test(x = airquality)
Correlation matrix 
        Ozone Solar.R  Wind  Temp Month   Day
Ozone    1.00    0.35 -0.60  0.70  0.16 -0.01
Solar.R  0.35    1.00 -0.06  0.28 -0.08 -0.15
Wind    -0.60   -0.06  1.00 -0.46 -0.18  0.03
Temp     0.70    0.28 -0.46  1.00  0.42 -0.13
Month    0.16   -0.08 -0.18  0.42  1.00 -0.01
Day     -0.01   -0.15  0.03 -0.13 -0.01  1.00
Sample Size 
        Ozone Solar.R Wind Temp Month Day
Ozone     116     111  116  116   116 116
Solar.R   111     146  146  146   146 146
Wind      116     146  153  153   153 153
Temp      116     146  153  153   153 153
Month     116     146  153  153   153 153
Day       116     146  153  153   153 153
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
        Ozone Solar.R Wind Temp Month  Day
Ozone    0.00    0.00 0.00 0.00  0.56 1.00
Solar.R  0.00    0.00 1.00 0.01  1.00 0.56
Wind     0.00    0.50 0.00 0.00  0.25 1.00
Temp     0.00    0.00 0.00 0.00  0.00 0.65
Month    0.08    0.37 0.03 0.00  0.00 1.00
Day      0.89    0.07 0.74 0.11  0.92 0.00

 To see confidence intervals of the correlations, print with the short=FALSE option

One thing to be aware of is that by default corr.test() produces p values that are adjusted for multiple comparisons in the top right hand triangle (i.e. above the diagonal). If you want the uncorrected values use the values below the diagonal (or pass adjust=FALSE when calling the function).

Working with correlation matrices

It’s important to realise that, as with all R objects, we can work with correlation matrices to continue our data ananalyses.

For example, as part of exploring your data, you might want to know whether correlations you observe in one sample are similar to those from another sample, when using the same questions. For example, let’s say we ran a survey measuring variables from the theory of planned behaviour first in students, and later in older adults:

We could run correlations for each sample separately:

corr.students <- cor(students)
corr.public <- cor(public)

And we could ‘eyeball’ both of these correlation matrices and try and spot patterns or differences between them, but this is quite hard:

corr.students %>%
  pander()

	behaviour	intention	control	social.norm	attitude
behaviour	1.000	0.55	0.648	0.072	0.132
intention	0.551	1.00	0.394	0.298	0.438
control	0.648	0.39	1.000	0.017	0.014
social.norm	0.072	0.30	0.017	1.000	-0.011
attitude	0.132	0.44	0.014	-0.011	1.000

corr.public %>%
  pander

	behaviour	intention	social.norm	attitude	control
behaviour	1.00	0.54	0.287	0.143	0.23
intention	0.54	1.00	0.361	0.290	0.37
social.norm	0.29	0.36	1.000	0.019	-0.08
attitude	0.14	0.29	0.019	1.000	0.04
control	0.23	0.37	-0.080	0.040	1.00

But we could also simply subtract one matrix from the other to show the difference directly:

(corr.students - corr.public) %>%
  pander()

	behaviour	intention	control	social.norm	attitude
behaviour	0.000	0.013	0.361	-0.071	-0.093
intention	0.013	0.000	0.033	0.008	0.067
control	0.361	0.033	0.000	-0.002	0.093
social.norm	-0.071	0.008	-0.002	0.000	-0.050
attitude	-0.093	0.067	0.093	-0.050	0.000

Now it’s much more obvious that the behaviour/control correlation differs between the samples (it’s higher in the students).

The point here is not that this is an analysis you are likely to actually report — although you might find it useful when exploring the data and interpreting your findings.

But rather this show that a correlation matrix, in common with the results of all the statistical tests we run, are themselves just data points. We can do whatever we like with our results — storing them in data frames to display later, or process as we need.

In reality, if you wanted to test the difference in correlations (slopes) in two groups for one outcome variable you probably want to use multiple regression, and if you wanted to test a complex model like the theory of planned behaviour, you might consider CFA and/or SEM).

Tables for publication

Using `apaTables`

If you want to produce nice correlation tables for publication the apaTables package might be useful. This block saves an APA formatted correlation table to an external Word document like this.

Note though, that the APA table format does encourage ‘star gazing’ to some degree. Try to avoid interpreting correlation tables solely based on the significance (or not) of the r values. The pairs or corrgram plots shown above are a much better summary of the data, and are can be just as compact.

library(apaTables)
apa.cor.table(airquality, filename="Table1_APA.doc", show.conf.interval=F)
The ability to suppress reporting of reporting confidence intervals has been deprecated in this version.
The function argument show.conf.interval will be removed in a later version.


Means, standard deviations, and correlations with confidence intervals
 

  Variable   M      SD    1            2           3           
  1. Ozone   42.13  32.99                                      
                                                               
  2. Solar.R 185.93 90.06 .35**                                
                          [.17, .50]                           
                                                               
  3. Wind    9.96   3.52  -.60**       -.06                    
                          [-.71, -.47] [-.22, .11]             
                                                               
  4. Temp    77.88  9.47  .70**        .28**       -.46**      
                          [.59, .78]   [.12, .42]  [-.57, -.32]
                                                               
  5. Month   6.99   1.42  .16          -.08        -.18*       
                          [-.02, .34]  [-.23, .09] [-.33, -.02]
                                                               
  6. Day     15.80  8.86  -.01         -.15        .03         
                          [-.20, .17]  [-.31, .01] [-.13, .19] 
                                                               
  4           5          
                         
                         
                         
                         
                         
                         
                         
                         
                         
                         
                         
  .42**                  
  [.28, .54]             
                         
  -.13        -.01       
  [-.28, .03] [-.17, .15]
                         

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
* indicates p < .05. ** indicates p < .01.

By hand

If you’re not bothered about strict APA format, you might still want to extract r and p values as dataframes which can then be saved to a csv and opened in Excel, or converted to a table some other way.

You can do this by storing the corr.test output in a variable, and the accessing the $r and $p values within it.

First, we create the corr.test object:

mycorrelations <- psych::corr.test(airquality)

Then extract the r values as a table:

mycorrelations$r %>%
  pander()

	Ozone	Solar.R	Wind	Temp	Month	Day
Ozone	1.000	0.348	-0.602	0.70	0.165	-0.013
Solar.R	0.348	1.000	-0.057	0.28	-0.075	-0.150
Wind	-0.602	-0.057	1.000	-0.46	-0.178	0.027
Temp	0.698	0.276	-0.458	1.00	0.421	-0.131
Month	0.165	-0.075	-0.178	0.42	1.000	-0.008
Day	-0.013	-0.150	0.027	-0.13	-0.008	1.000

And we can also extract p values:

mycorrelations$p %>%
  pander()

	Ozone	Solar.R	Wind	Temp	Month	Day
Ozone	0.000	0.002	0.000	0.000	0.56	1.00
Solar.R	0.000	0.000	1.000	0.008	1.00	0.56
Wind	0.000	0.496	0.000	0.000	0.25	1.00
Temp	0.000	0.001	0.000	0.000	0.00	0.65
Month	0.078	0.366	0.027	0.000	0.00	1.00
Day	0.888	0.070	0.739	0.108	0.92	0.00

Saving as a .csv is the same as for other dataframes:

write.csv(mycorrelations$r, file="airquality-r-values.csv")

And can also access the CI for each pariwise correlation as a table:

mycorrelations$ci %>%
  head() %>%
  pander(caption="First 6 rows of the table of CI's for the correlation matrix.")

First 6 rows of the table of CI’s for the correlation matrix.
	lower	r	upper	p
Ozone-Slr.R	0.173	0.348	0.50	0.000
Ozone-Wind	-0.706	-0.602	-0.47	0.000
Ozone-Temp	0.591	0.698	0.78	0.000
Ozone-Month	-0.018	0.165	0.34	0.078
Ozone-Day	-0.195	-0.013	0.17	0.888
Slr.R-Wind	-0.217	-0.057	0.11	0.496

Other methods for correlation

By default corr.test produces Pearson correlations, but You can pass the method argument psych::corr.test():

psych::corr.test(airquality, method="spearman")
psych::corr.test(airquality, method="kendall")