“Table 1”

Table 1 in reports of clinical trials and many psychological studies reports characteristics of the sample. Typically, you will want to present information collected at baseline, split by experimental groups, including:

  • Means, standard deviations or other descriptive statistics for continuous variables
  • Frequencies of particular responses for categorical variables
  • Some kind of inferential test for a zero-difference between the groups; this could be a t-test, an F-statistic where there are more than 2 groups, or a chi-squared test for categorical variables.

Producing this table is a pain because it requires collating multiple statistics, calculated from different functions. Many researchers resort to performing all the analyses required for each part of the table, and then copying-and-pasting results into Word.

It can be automated though! This example combines and extends many of the techniques we have learned using the split-apply-combine method.

To begin, let’s simulate some data from a fairly standard 2-arm clinical trial or psychological experiment:

Check our data:

boring.study %>% glimpse
Observations: 280
Variables: 8
$ person    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ condition <fct> Control, Control, Control, Control, Control, Control, …
$ yob       <dbl> 1982, 1979, 1981, 1978, 1969, 1975, 1979, 1974, 1977, …
$ WM        <dbl> 102, 96, 100, 102, 101, 85, 94, 113, 107, 79, 114, 118…
$ education <chr> "Secondary", "Graduate", "Graduate", "Graduate", "Prim…
$ ethnicity <chr> "White British", "White British", "Black / African / C…
$ Attitude  <dbl> 6, 11, 6, 4, 1, 8, 7, 8, 7, 7, 11, 5, 6, 7, 7, 8, 10, …

Start by making a long-form table for the categorical variables:

boring.study.categorical.melted <-
  table1.categorical.Ns <- boring.study %>%
  select(condition, education, ethnicity) %>%
  melt(id.var='condition')

Then calculate the N’s for each response/variable in each group:

(table1.categorical.Ns <-
  boring.study.categorical.melted %>%
  group_by(condition, variable, value) %>%
  summarise(N=n()) %>%
  dcast(variable+value~condition, value.var="N"))
   variable                                       value Control
1 education                                    Graduate      24
2 education                                Postgraduate      26
3 education                                     Primary      28
4 education                                   Secondary      31
5 education                                        <NA>      31
6 ethnicity                       Asian / Asian British      32
7 ethnicity Black / African / Caribbean / Black British      36
8 ethnicity              Mixed / multiple ethnic groups      40
9 ethnicity                               White British      32
  Intervention
1           34
2           28
3           31
4           23
5           24
6           28
7           36
8           41
9           35

Then make a second table containing Chi2 test statistics for each variable:

(table1.categorical.tests <-
  boring.study.categorical.melted %>%
  group_by(variable) %>%
  do(., chisq.test(.$value, .$condition) %>% tidy) %>%
  # this purely to facilitate matching rows up below
  mutate(firstrowforvar=T))
# A tibble: 2 x 6
# Groups:   variable [2]
  variable  statistic p.value parameter method               firstrowforvar
  <fct>         <dbl>   <dbl>     <int> <chr>                <lgl>         
1 education     2.92    0.404         3 Pearson's Chi-squar… TRUE          
2 ethnicity     0.413   0.937         3 Pearson's Chi-squar… TRUE          

Combine these together:

(table1.categorical.both <- table1.categorical.Ns %>%
  group_by(variable) %>%
  # we join on firstrowforvar to make sure we don't duplicate the tests
  mutate(firstrowforvar=row_number()==1) %>%
  left_join(., table1.categorical.tests, by=c("variable", "firstrowforvar")) %>%
  # this is gross, but we don't want to repeat the variable names in our table
  ungroup() %>%
  mutate(variable = ifelse(firstrowforvar==T, as.character(variable), NA)) %>%
  select(variable, value, Control, Intervention, statistic, parameter, p.value))
# A tibble: 9 x 7
  variable  value          Control Intervention statistic parameter p.value
  <chr>     <chr>            <int>        <int>     <dbl>     <int>   <dbl>
1 education Graduate            24           34     2.92          3   0.404
2 <NA>      Postgraduate        26           28    NA            NA  NA    
3 <NA>      Primary             28           31    NA            NA  NA    
4 <NA>      Secondary           31           23    NA            NA  NA    
5 <NA>      <NA>                31           24    NA            NA  NA    
6 ethnicity Asian / Asian…      32           28     0.413         3   0.937
7 <NA>      Black / Afric…      36           36    NA            NA  NA    
8 <NA>      Mixed / multi…      40           41    NA            NA  NA    
9 <NA>      White British       32           35    NA            NA  NA    

Now we deal with the continuous variables. First we make a ‘long’ version of the continuous data

continuous_variables <- c("yob", "WM")
boring.continuous.melted <-
  boring.study %>%
  select(condition, continuous_variables) %>%
  melt() %>%
  group_by(variable)
Using condition as id variables

boring.continuous.melted %>% head
# A tibble: 6 x 3
# Groups:   variable [1]
  condition variable value
  <fct>     <fct>    <dbl>
1 Control   yob       1982
2 Control   yob       1979
3 Control   yob       1981
4 Control   yob       1978
5 Control   yob       1969
6 Control   yob       1975

Then calculate separate tables of t-tests and means/SD’s:

(table.continuous_variables.tests <-
    boring.continuous.melted %>%
    # note that we pass the result of t-test to tidy, which returns a dataframe
    do(., t.test(.$value~.$condition) %>% tidy) %>%
    select(variable, statistic, parameter, p.value))
# A tibble: 2 x 4
# Groups:   variable [2]
  variable statistic parameter p.value
  <fct>        <dbl>     <dbl>   <dbl>
1 yob         -1.07       269.   0.285
2 WM          -0.455      276.   0.649

(table.continuous_variables.descriptives <-
    boring.continuous.melted %>%
    group_by(variable, condition) %>%
    # this is not needed here because we have no missing values, but if there
    # were missing value in this dataset then mean/sd functions would fail below,
    #  so best to remove rows without a response:
    filter(!is.na(value)) %>%
    # note, we might also want the median/IQR
    summarise(Mean=mean(value), SD=sd(value)) %>%
    group_by(variable, condition) %>%
    # we format the mean and SD into a single column using sprintf.
    # we don't have to do this, but it makes reshaping simpler and we probably want
    # to round the numbers at some point, and so may as well do this now.
    transmute(MSD = sprintf("%.2f (%.2f)", Mean, SD)) %>%
    dcast(variable~condition))
Using MSD as value column: use value.var to override.
  variable        Control   Intervention
1      yob 1979.31 (5.58) 1979.97 (4.63)
2       WM  99.37 (10.00)  99.94 (10.99)

And combine them:

(table.continuous_variables.both <-
  left_join(table.continuous_variables.descriptives,
            table.continuous_variables.tests))
Joining, by = "variable"
  variable        Control   Intervention  statistic parameter   p.value
1      yob 1979.31 (5.58) 1979.97 (4.63) -1.0714551  268.8780 0.2849256
2       WM  99.37 (10.00)  99.94 (10.99) -0.4549637  275.5261 0.6494937

Finally put the whole thing together:

(table1 <- table1.categorical.both %>%
  # make these variables into character format to be consistent with
  # the Mean (SD) column for continuus variables
  mutate_each(funs(format), Control, Intervention) %>%
  # note the '.' as the first argument, which is the input from the pipe
  bind_rows(.,
          table.continuous_variables.both) %>%
  # prettify a few things
  rename(df = parameter,
         p=p.value,
         `Control N/Mean (SD)`= Control,
         Variable=variable,
         Response=value,
         `t/χ2` = statistic))
Warning: funs() is soft deprecated as of dplyr 0.8.0
please use list() instead

  # Before:
  funs(name = f(.))

  # After: 
  list(name = ~ f(.))
This warning is displayed once per session.
Warning in bind_rows_(x, .id): binding character and factor vector,
coercing into character vector
# A tibble: 11 x 7
   Variable  Response     `Control N/Mean… Intervention `t/χ2`    df      p
   <chr>     <chr>        <chr>            <chr>         <dbl> <dbl>  <dbl>
 1 education Graduate     24               34            2.92     3   0.404
 2 <NA>      Postgraduate 26               28           NA       NA  NA    
 3 <NA>      Primary      28               31           NA       NA  NA    
 4 <NA>      Secondary    31               23           NA       NA  NA    
 5 <NA>      <NA>         31               24           NA       NA  NA    
 6 ethnicity Asian / Asi… 32               28            0.413    3   0.937
 7 <NA>      Black / Afr… 36               36           NA       NA  NA    
 8 <NA>      Mixed / mul… 40               41           NA       NA  NA    
 9 <NA>      White Briti… 32               35           NA       NA  NA    
10 yob       <NA>         1979.31 (5.58)   1979.97 (4.… -1.07   269.  0.285
11 WM        <NA>         99.37 (10.00)    99.94 (10.9… -0.455  276.  0.649

And we can print to markdown format for outputting. This is best done in a separate chunk to avoid warnings/messages appearing in the final document.

table1 %>%
  # split.tables argument needed to avoid the table wrapping
  pander(split.tables=Inf,
         missing="-",
         justify=c("left", "left", rep("center", 5)),
         caption='Table presenting baseline differences between conditions. Categorical variables tested with Pearson χ2, continuous variables with two-sample t-test.')
Table presenting baseline differences between conditions. Categorical variables tested with Pearson χ2, continuous variables with two-sample t-test.
Variable Response Control N/Mean (SD) Intervention t/χ2 df p
education Graduate 24 34 2.921 3 0.404
- Postgraduate 26 28 - - -
- Primary 28 31 - - -
- Secondary 31 23 - - -
- - 31 24 - - -
ethnicity Asian / Asian British 32 28 0.4133 3 0.9375
- Black / African / Caribbean / Black British 36 36 - - -
- Mixed / multiple ethnic groups 40 41 - - -
- White British 32 35 - - -
yob - 1979.31 (5.58) 1979.97 (4.63) -1.071 268.9 0.2849
WM - 99.37 (10.00) 99.94 (10.99) -0.455 275.5 0.6495

Some exercises to work on/extensions to this code you might need:

  • Add a new continuous variable to the simulated dataset and include it in the final table
  • Create a third experimental group and amend the code to i) include 3 columns for the N/Mean and ii) report the F-test from a one-way Anova as the test statistic.
  • Add the within-group percentage for each response to a categorical variable.