## “Table 1”

Table 1 in reports of clinical trials and many psychological studies reports characteristics of the sample. Typically, you will want to present information collected at baseline, split by experimental groups, including:

• Means, standard deviations or other descriptive statistics for continuous variables
• Frequencies of particular responses for categorical variables
• Some kind of inferential test for a zero-difference between the groups; this could be a t-test, an F-statistic where there are more than 2 groups, or a chi-squared test for categorical variables.

Producing this table is a pain because it requires collating multiple statistics, calculated from different functions. Many researchers resort to performing all the analyses required for each part of the table, and then copying-and-pasting results into Word.

It can be automated though! This example combines and extends many of the techniques we have learned using the split-apply-combine method.

To begin, let’s simulate some data from a fairly standard 2-arm clinical trial or psychological experiment:

Check our data:

boring.study %>% glimpse
Observations: 280
Variables: 8
$person <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…$ time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$condition <fct> Control, Control, Control, Control, Control, Control, …$ yob       <dbl> 1982, 1979, 1981, 1978, 1969, 1975, 1979, 1974, 1977, …
$WM <dbl> 102, 96, 100, 102, 101, 85, 94, 113, 107, 79, 114, 118…$ education <chr> "Secondary", "Graduate", "Graduate", "Graduate", "Prim…
$ethnicity <chr> "White British", "White British", "Black / African / C…$ Attitude  <dbl> 6, 11, 6, 4, 1, 8, 7, 8, 7, 7, 11, 5, 6, 7, 7, 8, 10, …

Start by making a long-form table for the categorical variables:

boring.study.categorical.melted <-
table1.categorical.Ns <- boring.study %>%
select(condition, education, ethnicity) %>%
melt(id.var='condition')

Then calculate the N’s for each response/variable in each group:

(table1.categorical.Ns <-
boring.study.categorical.melted %>%
group_by(condition, variable, value) %>%
summarise(N=n()) %>%
dcast(variable+value~condition, value.var="N"))
variable                                       value Control
3 education                                     Primary      28
4 education                                   Secondary      31
5 education                                        <NA>      31
6 ethnicity                       Asian / Asian British      32
7 ethnicity Black / African / Caribbean / Black British      36
8 ethnicity              Mixed / multiple ethnic groups      40
9 ethnicity                               White British      32
Intervention
1           34
2           28
3           31
4           23
5           24
6           28
7           36
8           41
9           35

Then make a second table containing Chi2 test statistics for each variable:

(table1.categorical.tests <-
boring.study.categorical.melted %>%
group_by(variable) %>%
do(., chisq.test(.$value, .$condition) %>% tidy) %>%
# this purely to facilitate matching rows up below
mutate(firstrowforvar=T))
# A tibble: 2 x 6
# Groups:   variable [2]
variable  statistic p.value parameter method               firstrowforvar
<fct>         <dbl>   <dbl>     <int> <chr>                <lgl>
1 education     2.92    0.404         3 Pearson's Chi-squar… TRUE
2 ethnicity     0.413   0.937         3 Pearson's Chi-squar… TRUE          

Combine these together:

(table1.categorical.both <- table1.categorical.Ns %>%
group_by(variable) %>%
# we join on firstrowforvar to make sure we don't duplicate the tests
mutate(firstrowforvar=row_number()==1) %>%
left_join(., table1.categorical.tests, by=c("variable", "firstrowforvar")) %>%
# this is gross, but we don't want to repeat the variable names in our table
ungroup() %>%
mutate(variable = ifelse(firstrowforvar==T, as.character(variable), NA)) %>%
select(variable, value, Control, Intervention, statistic, parameter, p.value))
# A tibble: 9 x 7
variable  value          Control Intervention statistic parameter p.value
<chr>     <chr>            <int>        <int>     <dbl>     <int>   <dbl>
1 education Graduate            24           34     2.92          3   0.404
2 <NA>      Postgraduate        26           28    NA            NA  NA
3 <NA>      Primary             28           31    NA            NA  NA
4 <NA>      Secondary           31           23    NA            NA  NA
5 <NA>      <NA>                31           24    NA            NA  NA
6 ethnicity Asian / Asian…      32           28     0.413         3   0.937
7 <NA>      Black / Afric…      36           36    NA            NA  NA
8 <NA>      Mixed / multi…      40           41    NA            NA  NA
9 <NA>      White British       32           35    NA            NA  NA    

Now we deal with the continuous variables. First we make a ‘long’ version of the continuous data

continuous_variables <- c("yob", "WM")
boring.continuous.melted <-
boring.study %>%
select(condition, continuous_variables) %>%
melt() %>%
group_by(variable)
Using condition as id variables

# A tibble: 6 x 3
# Groups:   variable [1]
condition variable value
<fct>     <fct>    <dbl>
1 Control   yob       1982
2 Control   yob       1979
3 Control   yob       1981
4 Control   yob       1978
5 Control   yob       1969
6 Control   yob       1975

Then calculate separate tables of t-tests and means/SD’s:

(table.continuous_variables.tests <-
boring.continuous.melted %>%
# note that we pass the result of t-test to tidy, which returns a dataframe
do(., t.test(.$value~.$condition) %>% tidy) %>%
select(variable, statistic, parameter, p.value))
# A tibble: 2 x 4
# Groups:   variable [2]
variable statistic parameter p.value
<fct>        <dbl>     <dbl>   <dbl>
1 yob         -1.07       269.   0.285
2 WM          -0.455      276.   0.649

(table.continuous_variables.descriptives <-
boring.continuous.melted %>%
group_by(variable, condition) %>%
# this is not needed here because we have no missing values, but if there
# were missing value in this dataset then mean/sd functions would fail below,
#  so best to remove rows without a response:
filter(!is.na(value)) %>%
# note, we might also want the median/IQR
summarise(Mean=mean(value), SD=sd(value)) %>%
group_by(variable, condition) %>%
# we format the mean and SD into a single column using sprintf.
# we don't have to do this, but it makes reshaping simpler and we probably want
# to round the numbers at some point, and so may as well do this now.
transmute(MSD = sprintf("%.2f (%.2f)", Mean, SD)) %>%
dcast(variable~condition))
Using MSD as value column: use value.var to override.
variable        Control   Intervention
1      yob 1979.31 (5.58) 1979.97 (4.63)
2       WM  99.37 (10.00)  99.94 (10.99)

And combine them:

(table.continuous_variables.both <-
left_join(table.continuous_variables.descriptives,
table.continuous_variables.tests))
Joining, by = "variable"
variable        Control   Intervention  statistic parameter   p.value
1      yob 1979.31 (5.58) 1979.97 (4.63) -1.0714551  268.8780 0.2849256
2       WM  99.37 (10.00)  99.94 (10.99) -0.4549637  275.5261 0.6494937

Finally put the whole thing together:

(table1 <- table1.categorical.both %>%
# make these variables into character format to be consistent with
# the Mean (SD) column for continuus variables
mutate_each(funs(format), Control, Intervention) %>%
# note the '.' as the first argument, which is the input from the pipe
bind_rows(.,
table.continuous_variables.both) %>%
# prettify a few things
rename(df = parameter,
p=p.value,
Control N/Mean (SD)= Control,
Variable=variable,
Response=value,
t/χ2 = statistic))
Warning: funs() is soft deprecated as of dplyr 0.8.0

# Before:
funs(name = f(.))

# After:
list(name = ~ f(.))
This warning is displayed once per session.
Warning in bind_rows_(x, .id): binding character and factor vector,
coercing into character vector
# A tibble: 11 x 7
Variable  Response     Control N/Mean… Intervention t/χ2    df      p
<chr>     <chr>        <chr>            <chr>         <dbl> <dbl>  <dbl>
1 education Graduate     24               34            2.92     3   0.404
2 <NA>      Postgraduate 26               28           NA       NA  NA
3 <NA>      Primary      28               31           NA       NA  NA
4 <NA>      Secondary    31               23           NA       NA  NA
5 <NA>      <NA>         31               24           NA       NA  NA
6 ethnicity Asian / Asi… 32               28            0.413    3   0.937
7 <NA>      Black / Afr… 36               36           NA       NA  NA
8 <NA>      Mixed / mul… 40               41           NA       NA  NA
9 <NA>      White Briti… 32               35           NA       NA  NA
10 yob       <NA>         1979.31 (5.58)   1979.97 (4.… -1.07   269.  0.285
11 WM        <NA>         99.37 (10.00)    99.94 (10.9… -0.455  276.  0.649

And we can print to markdown format for outputting. This is best done in a separate chunk to avoid warnings/messages appearing in the final document.

table1 %>%
# split.tables argument needed to avoid the table wrapping
pander(split.tables=Inf,
missing="-",
justify=c("left", "left", rep("center", 5)),
caption='Table presenting baseline differences between conditions. Categorical variables tested with Pearson χ2, continuous variables with two-sample t-test.')`
Table presenting baseline differences between conditions. Categorical variables tested with Pearson χ2, continuous variables with two-sample t-test.
Variable Response Control N/Mean (SD) Intervention t/χ2 df p
education Graduate 24 34 2.921 3 0.404
- Postgraduate 26 28 - - -
- Primary 28 31 - - -
- Secondary 31 23 - - -
- - 31 24 - - -
ethnicity Asian / Asian British 32 28 0.4133 3 0.9375
- Black / African / Caribbean / Black British 36 36 - - -
- Mixed / multiple ethnic groups 40 41 - - -
- White British 32 35 - - -
yob - 1979.31 (5.58) 1979.97 (4.63) -1.071 268.9 0.2849
WM - 99.37 (10.00) 99.94 (10.99) -0.455 275.5 0.6495

Some exercises to work on/extensions to this code you might need:

• Add a new continuous variable to the simulated dataset and include it in the final table
• Create a third experimental group and amend the code to i) include 3 columns for the N/Mean and ii) report the F-test from a one-way Anova as the test statistic.
• Add the within-group percentage for each response to a categorical variable.