Missing values

Missing values aren’t a data type as such, but are an important concept in R; the way different functions handle missing values can be both helpful and frustrating in equal measure.

Missing values in a vector are denoted by the letters NA, but notice that these letters are unquoted. That is to say NA is not the same as "NA"!

To check for missing values in a vector (or dataframe column) we use the is.na() function:

nums.with.missing <- c(1, 2, NA)
nums.with.missing
[1]  1  2 NA

is.na(nums.with.missing)
[1] FALSE FALSE  TRUE

Here the is.na() function has tested whether each item in our vector called nums.with.missing is missing. It returns a new vector with the results of each test: either TRUE or FALSE.

We can also use the negation operator, the ! symbol to reverse the meaning of is.na. So we can read !is.na(nums) as “test whether the values in nums are NOT missing”:

# test if missing
is.na(nums.with.missing)
[1] FALSE FALSE  TRUE

# test if NOT missing (note the exclamation mark in front of the function)
!is.na(nums.with.missing)
[1]  TRUE  TRUE FALSE

We can use the is.na() function as part of dplyr filters:

airquality %>% 
  filter(is.na(Solar.R)) %>% 
  head(3) %>% 
  pander

Ozone	Solar.R	Wind	Temp	Month	Day
NA	NA	14.3	56	5	5
28	NA	14.9	66	5	6
7	NA	6.9	74	5	11

Or to select only cases without missing values for a particular variable:

airquality %>% 
  filter(!is.na(Solar.R)) %>% 
  head(3) %>% 
  pander

Ozone	Solar.R	Wind	Temp	Month	Day
41	190	7.4	67	5	1
36	118	8	72	5	2
12	149	12.6	74	5	3

Complete cases

Sometimes we want to select only rows which have no missing values — i.e. complete cases.

The complete.cases function accepts a dataframe (or matrix) and tests whether each row is complete. It returns a vector with a TRUE/FALSE result for each row:

complete.cases(airquality) %>% 
  head
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

This can also be useful in dplyr filters. Here we show all the rows which are not complete (note the exclamation mark):

airquality %>% 
  filter(!complete.cases(airquality))
   Ozone Solar.R Wind Temp Month Day
1     NA      NA 14.3   56     5   5
2     28      NA 14.9   66     5   6
3     NA     194  8.6   69     5  10
4      7      NA  6.9   74     5  11
5     NA      66 16.6   57     5  25
6     NA     266 14.9   58     5  26
7     NA      NA  8.0   57     5  27
8     NA     286  8.6   78     6   1
9     NA     287  9.7   74     6   2
10    NA     242 16.1   67     6   3
11    NA     186  9.2   84     6   4
12    NA     220  8.6   85     6   5
13    NA     264 14.3   79     6   6
14    NA     273  6.9   87     6   8
15    NA     259 10.9   93     6  11
16    NA     250  9.2   92     6  12
17    NA     332 13.8   80     6  14
18    NA     322 11.5   79     6  15
19    NA     150  6.3   77     6  21
20    NA      59  1.7   76     6  22
21    NA      91  4.6   76     6  23
22    NA     250  6.3   76     6  24
23    NA     135  8.0   75     6  25
24    NA     127  8.0   78     6  26
25    NA      47 10.3   73     6  27
26    NA      98 11.5   80     6  28
27    NA      31 14.9   77     6  29
28    NA     138  8.0   83     6  30
29    NA     101 10.9   84     7   4
30    NA     139  8.6   82     7  11
31    NA     291 14.9   91     7  14
32    NA     258  9.7   81     7  22
33    NA     295 11.5   82     7  23
34    78      NA  6.9   86     8   4
35    35      NA  7.4   85     8   5
36    66      NA  4.6   87     8   6
37    NA     222  8.6   92     8  10
38    NA     137 11.5   86     8  11
39    NA      64 11.5   79     8  15
40    NA     255 12.6   75     8  23
41    NA     153  5.7   88     8  27
42    NA     145 13.2   77     9  27

Sometimes it’s convenient to use the . (period) to represent the output from the previous pipe command. For example, we could rewrite the previous example as:

airquality %>% 
  filter(!complete.cases(.))  # note the . (period) here in place of `airmiles`
   Ozone Solar.R Wind Temp Month Day
1     NA      NA 14.3   56     5   5
2     28      NA 14.9   66     5   6
3     NA     194  8.6   69     5  10
4      7      NA  6.9   74     5  11
5     NA      66 16.6   57     5  25
6     NA     266 14.9   58     5  26
7     NA      NA  8.0   57     5  27
8     NA     286  8.6   78     6   1
9     NA     287  9.7   74     6   2
10    NA     242 16.1   67     6   3
11    NA     186  9.2   84     6   4
12    NA     220  8.6   85     6   5
13    NA     264 14.3   79     6   6
14    NA     273  6.9   87     6   8
15    NA     259 10.9   93     6  11
16    NA     250  9.2   92     6  12
17    NA     332 13.8   80     6  14
18    NA     322 11.5   79     6  15
19    NA     150  6.3   77     6  21
20    NA      59  1.7   76     6  22
21    NA      91  4.6   76     6  23
22    NA     250  6.3   76     6  24
23    NA     135  8.0   75     6  25
24    NA     127  8.0   78     6  26
25    NA      47 10.3   73     6  27
26    NA      98 11.5   80     6  28
27    NA      31 14.9   77     6  29
28    NA     138  8.0   83     6  30
29    NA     101 10.9   84     7   4
30    NA     139  8.6   82     7  11
31    NA     291 14.9   91     7  14
32    NA     258  9.7   81     7  22
33    NA     295 11.5   82     7  23
34    78      NA  6.9   86     8   4
35    35      NA  7.4   85     8   5
36    66      NA  4.6   87     8   6
37    NA     222  8.6   92     8  10
38    NA     137 11.5   86     8  11
39    NA      64 11.5   79     8  15
40    NA     255 12.6   75     8  23
41    NA     153  5.7   88     8  27
42    NA     145 13.2   77     9  27

This is nice because we can apply the complete.cases function to the output of the previous pipe. For example, if we wanted to select complete cases for a subset of the variables we could write:

airquality %>% 
  select(Ozone, Solar.R) %>% 
  filter(!complete.cases(.))
   Ozone Solar.R
1     NA      NA
2     28      NA
3     NA     194
4      7      NA
5     NA      66
6     NA     266
7     NA      NA
8     NA     286
9     NA     287
10    NA     242
11    NA     186
12    NA     220
13    NA     264
14    NA     273
15    NA     259
16    NA     250
17    NA     332
18    NA     322
19    NA     150
20    NA      59
21    NA      91
22    NA     250
23    NA     135
24    NA     127
25    NA      47
26    NA      98
27    NA      31
28    NA     138
29    NA     101
30    NA     139
31    NA     291
32    NA     258
33    NA     295
34    78      NA
35    35      NA
36    66      NA
37    NA     222
38    NA     137
39    NA      64
40    NA     255
41    NA     153
42    NA     145

Or alternatively:

rows.to.keep <- !complete.cases(select(airquality, Ozone, Solar.R))
airquality %>% 
  filter(rows.to.keep) %>% 
  head(3) %>% 
  pander

Ozone	Solar.R	Wind	Temp	Month	Day
NA	NA	14.3	56	5	5
28	NA	14.9	66	5	6
NA	194	8.6	69	5	10

Missing data and R functions

It’s normally good practice to pre-process your data and select the rows you want to analyse before passing dataframes to R functions.

The reason for this is that different functions behave differently with missing data.

For example:

mean(airquality$Solar.R)
[1] NA

Here the default for mean() is to return NA if any of the values are missing. We can explicitly tell R to ignore missing values by setting na.rm=TRUE

mean(airquality$Solar.R, na.rm=TRUE)
[1] 185.9315

In contrast some other functions, for example the lm() which runs a linear regression will ignore missing values by default. If we run summary on the call to lm then we can see the line near the bottom of the output which reads: “(7 observations deleted due to missingness)”

lm(Solar.R ~ Temp, data=airquality) %>% 
  summary

Call:
lm(formula = Solar.R ~ Temp, data = airquality)

Residuals:
     Min       1Q   Median       3Q      Max 
-169.697  -59.315    6.224   67.685  186.083 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -24.431     61.508  -0.397 0.691809    
Temp           2.693      0.782   3.444 0.000752 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 86.86 on 144 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.07609,   Adjusted R-squared:  0.06967 
F-statistic: 11.86 on 1 and 144 DF,  p-value: 0.0007518

Normally R will do the ‘sensible thing’ when there are missing values, but it’s always worth checking whether you do have any missing data, and addressing this explicitly in your code

Patterns of missingness

The mice package has some nice functions to describe patterns of missingness in the data. These can be useful both at the exploratory stage, when you are checking and validating your data, but can also be used to create tables of missingness for publication:

mice::md.pattern(airquality)

    Wind Temp Month Day Solar.R Ozone   
111    1    1     1   1       1     1  0
35     1    1     1   1       1     0  1
5      1    1     1   1       0     1  1
2      1    1     1   1       0     0  2
       0    0     0   0       7    37 44

In this table, md.pattern list the number of cases with particular patterns of missing data. - Each row describes a misisng data ‘pattern’ - The first column indicates the number of cases - The central columns indicate whether a particular variable is missing for the pattern (0=missing) - The last column counts the number of values missing for the pattern - The final row counts the number of missing values for each variable.

Visualising missingness

Graphics can also be useful to explore patterns in missingness.

rct.data contains data from an RCT of functional imagery training (FIT) for weight loss, which measured outcome (weight in kg) at baseline and two followups (kg1, kg2, kg3). The trial also measured global quality of life (gqol).

As is common, there were some missing data at the follouwp:

fit.data <- readRDS("data/fit-weight.RDS") %>% 
  select(kg1, kg2, kg3, age, gqol1)

mice::md.pattern(fit.data)

    kg1 age gqol1 kg2 kg3   
112   1   1     1   1   1  0
2     1   1     1   1   0  1
7     1   1     1   0   0  2
8     0   0     0   0   0  5
      8   8     8  15  17 56

We might be interested to explore patterns in which observations were missing. Here we use colour to identify missing observations as a function of the data recorded at baseline:

fit.data %>% 
  mutate(missing.followup = is.na(kg2)) %>% 
  ggplot(aes(kg1, age, color=missing.followup)) +
  geom_point()

There’s a clear trend here for lighter patients (at baseline) to have more missing data at followup. There’s also a suggestion that younger patients are more likely to have been lost to followup.

If needed, we could perform inferential tests for these differences:

t.test(kg1 ~ is.na(kg2), data=fit.data)

    Welch Two Sample t-test

data:  kg1 by is.na(kg2)
t = 4.7153, df = 11.132, p-value = 0.000614
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7.005116 19.236238
sample estimates:
mean in group FALSE  mean in group TRUE 
           90.59211            77.47143 
t.test(age ~ is.na(kg2), data=fit.data)

    Welch Two Sample t-test

data:  age by is.na(kg2)
t = 1.2418, df = 6.5246, p-value = 0.2571
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.39455 23.25169
sample estimates:
mean in group FALSE  mean in group TRUE 
           43.50000            35.57143

However, given the small number of missing values and the post-hoc nature of these analyses these tests are rather underpowered and we might prefer to report and comment on the plot alone.

For some nice missing data visualisation techniques, including those for repeated measures data, see @zhang2015missing.