Missing values

Missing values aren’t a data type as such, but are an important concept in R; the way different functions handle missing values can be both helpful and frustrating in equal measure.

Missing values in a vector are denoted by the letters NA, but notice that these letters are unquoted. That is to say NA is not the same as "NA"!

To check for missing values in a vector (or dataframe column) we use the is.na() function:

nums.with.missing <- c(1, 2, NA)
[1]  1  2 NA


Here the is.na() function has tested whether each item in our vector called nums.with.missing is missing. It returns a new vector with the results of each test: either TRUE or FALSE.

We can also use the negation operator, the ! symbol to reverse the meaning of is.na. So we can read !is.na(nums) as “test whether the values in nums are NOT missing”:

# test if missing

# test if NOT missing (note the exclamation mark in front of the function)

We can use the is.na() function as part of dplyr filters:

airquality %>% 
  filter(is.na(Solar.R)) %>% 
  head(3) %>% 
Ozone Solar.R Wind Temp Month Day
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6
7 NA 6.9 74 5 11

Or to select only cases without missing values for a particular variable:

airquality %>% 
  filter(!is.na(Solar.R)) %>% 
  head(3) %>% 
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8 72 5 2
12 149 12.6 74 5 3

Complete cases

Sometimes we want to select only rows which have no missing values — i.e. complete cases.

The complete.cases function accepts a dataframe (or matrix) and tests whether each row is complete. It returns a vector with a TRUE/FALSE result for each row:

complete.cases(airquality) %>% 

This can also be useful in dplyr filters. Here we show all the rows which are not complete (note the exclamation mark):

airquality %>% 
Sometimes it’s convenient to use the . (period) to represent the output from the previous pipe command. For example, we could rewrite the previous example as:

airquality %>% 
  filter(!complete.cases(.))  # note the . (period) here in place of `airmiles`
This is nice because we can apply the complete.cases function to the output of the previous pipe. For example, if we wanted to select complete cases for a subset of the variables we could write:

airquality %>% 
  select(Ozone, Solar.R) %>% 
Or alternatively:

rows.to.keep <- !complete.cases(select(airquality, Ozone, Solar.R))
airquality %>% 
  filter(rows.to.keep) %>% 
  head(3) %>% 
Ozone Solar.R Wind Temp Month Day
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6
NA 194 8.6 69 5 10

Missing data and R functions

It’s normally good practice to pre-process your data and select the rows you want to analyse before passing dataframes to R functions.

The reason for this is that different functions behave differently with missing data.

For example:

[1] NA

Here the default for mean() is to return NA if any of the values are missing. We can explicitly tell R to ignore missing values by setting na.rm=TRUE

mean(airquality$Solar.R, na.rm=TRUE)
[1] 185.9315

In contrast some other functions, for example the lm() which runs a linear regression will ignore missing values by default. If we run summary on the call to lm then we can see the line near the bottom of the output which reads: “(7 observations deleted due to missingness)”

lm(Solar.R ~ Temp, data=airquality) %>% 

lm(formula = Solar.R ~ Temp, data = airquality)

     Min       1Q   Median       3Q      Max 
-169.697  -59.315    6.224   67.685  186.083 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -24.431     61.508  -0.397 0.691809    
Temp           2.693      0.782   3.444 0.000752 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 86.86 on 144 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.07609,   Adjusted R-squared:  0.06967 
F-statistic: 11.86 on 1 and 144 DF,  p-value: 0.0007518

Normally R will do the ‘sensible thing’ when there are missing values, but it’s always worth checking whether you do have any missing data, and addressing this explicitly in your code

Patterns of missingness

The mice package has some nice functions to describe patterns of missingness in the data. These can be useful both at the exploratory stage, when you are checking and validating your data, but can also be used to create tables of missingness for publication:


    Wind Temp Month Day Solar.R Ozone   
111    1    1     1   1       1     1  0
35     1    1     1   1       1     0  1
5      1    1     1   1       0     1  1
2      1    1     1   1       0     0  2
       0    0     0   0       7    37 44

In this table, md.pattern list the number of cases with particular patterns of missing data. - Each row describes a misisng data ‘pattern’ - The first column indicates the number of cases - The central columns indicate whether a particular variable is missing for the pattern (0=missing) - The last column counts the number of values missing for the pattern - The final row counts the number of missing values for each variable.

Visualising missingness

Graphics can also be useful to explore patterns in missingness.

rct.data contains data from an RCT of functional imagery training (FIT) for weight loss, which measured outcome (weight in kg) at baseline and two followups (kg1, kg2, kg3). The trial also measured global quality of life (gqol).

As is common, there were some missing data at the follouwp:

fit.data <- readRDS("data/fit-weight.RDS") %>% 
  select(kg1, kg2, kg3, age, gqol1)


    kg1 age gqol1 kg2 kg3   
112   1   1     1   1   1  0
2     1   1     1   1   0  1
7     1   1     1   0   0  2
8     0   0     0   0   0  5
      8   8     8  15  17 56

We might be interested to explore patterns in which observations were missing. Here we use colour to identify missing observations as a function of the data recorded at baseline:

fit.data %>% 
  mutate(missing.followup = is.na(kg2)) %>% 
  ggplot(aes(kg1, age, color=missing.followup)) +

There’s a clear trend here for lighter patients (at baseline) to have more missing data at followup. There’s also a suggestion that younger patients are more likely to have been lost to followup.

If needed, we could perform inferential tests for these differences:

t.test(kg1 ~ is.na(kg2), data=fit.data)

    Welch Two Sample t-test

data:  kg1 by is.na(kg2)
t = 4.7153, df = 11.132, p-value = 0.000614
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7.005116 19.236238
sample estimates:
mean in group FALSE  mean in group TRUE 
           90.59211            77.47143 
t.test(age ~ is.na(kg2), data=fit.data)

    Welch Two Sample t-test

data:  age by is.na(kg2)
t = 1.2418, df = 6.5246, p-value = 0.2571
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.39455 23.25169
sample estimates:
mean in group FALSE  mean in group TRUE 
           43.50000            35.57143 

However, given the small number of missing values and the post-hoc nature of these analyses these tests are rather underpowered and we might prefer to report and comment on the plot alone.

For some nice missing data visualisation techniques, including those for repeated measures data, see @zhang2015missing.