2  Descriptive Analysis


We start our evaluation with a descriptive analysis of our trial data. The data we will be working with for the remainder of this tutorial can be downloaded here, or from the tutorial’s Github repository. We refer readers who have skipped the first part of the tutorial to Chapter 1, where we covered how to correctly set up one’s R environment and import the data used in the following analyses.

First, let us familiarize ourselves with the data. Our dataset is based on a “real-world” RCT which evaluated the effectiveness of an Internet-based treatment for subthreshold depression to a waitlist control condition (Ebert et al. 2018). Notably, the data set provided here does not consist of the original individual patient data collected in the trial; we have simulated a “clone” of this trial that shares most of its statistical properties but does not contain any actual patient data. The simulation code is provided in the Github repository (see here).

Our primary outcome in this trial evaluation will be patients’ depressive symptom severity at post-test (7 weeks after randomization), as measured by the Center for Epidemiological Studies’ Depression Scale or CES-D (Radloff 1977). The scores obtained by this instrument can take values from 0 to 60, with higher values indicating higher depressive symptom severity. The CES-D was also used to measure depressive symptoms severity at baseline and at a 3-month follow-up. This means that our data set contains three variables: cesd.0, cesd.1 and cesd.2, which store the CES-D values at baseline, post-test and 3-month follow-up, respectively. The other recorded variables are:

An excellent method to get an overview of all variables included in a data frame is to use the skim function included in the skimr package (Waring et al. 2022). Once this package has been installed on our computer, we can run the following code:

library(skimr)
skim(data)
Show result
## -- Data Summary ------------------------
## Values
## Name                       data  
## Number of rows             546   
## Number of columns          21    
## [...]  
## 
## -- Variable type: character ----------------------------------
##   skim_variable n_missing complete_rate min max empty n_unique
## 1 id                    0             1   4   6     0      546
## 2 sex                   0             1   1   1     0        2
## 3 prevpsychoth          0             1   1   1     0        2
## 4 prevtraining          0             1   1   1     0        2
## 5 child                 0             1   1   1     0        2
## 6 employ                0             1   1   1     0        2
## 7 degree                0             1   1   1     0        4
## 
## -- Variable type: numeric ------------------------------------
##    skim_variable n_missing complete_rate  mean     sd
## 1  badssf.2            165         0.698 27.0   8.00
## 2  badssf.1            155         0.716 27.9   6.96
## 3  hadsa.2             165         0.698  7.78  3.93
## 4  hadsa.1             155         0.716  7.95  3.47
## 5  cesd.2              165         0.698 20.1  10.3 
## 6  cesd.1              155         0.716 19.7   9.15
## 7  age                   0         1     45.3  11.5
## 8  rel                   0         1      0.855  0.695
## 9  cesd.0                0         1     26.6   7.13
## 10 atsphs.0              0         1     13.9   2.57
## 11 ceq                   0         1     34.7   8.44
## 12 badssf.0              0         1     26.3   6.02
## 13 hadsa.0               0         1     10.0   3.04
## 14 group                 0         1      0.5   0.500 

There is a lot to unpack here, so let us go through the output (which we simplified a little here) step by step. First, we presented with a summary of our data set: we have 546 rows, corresponding to the \(N\)=546 patients included in our trial. We also see that we have 21 variables in our data, details of which are provided below. The skim function splits the displayed summary statistics by class; the first section shows columns encoded as character variables, and the second section shows the numeric columns.

This is also a good way to check if imported variables have indeed the desired class, which is not the case in our example: the only “real” character variable in our data set is id; all other columns are actually numeric (although one could also encode them as factors). Thus, let us convert those variables first, which can be done using the as.numeric function:

data$sex <- as.numeric(data$sex)
data$prevpsychoth <- as.numeric(data$prevpsychoth)
data$prevtraining <- as.numeric(data$prevtraining)
data$child <- as.numeric(data$child)
data$employ <- as.numeric(data$employ)
data$degree <- as.numeric(data$degree)

To further eyeball the data, it is also helpful to run the skim function separately for the intervention and control group. We can achieve this by using the filter function within a pipe:

data %>% filter(group == 0) %>% skim() # Control group
data %>% filter(group == 1) %>% skim() # Intervention group

Splitting up the results by group is particularly helpful to see differences in the primary outcome. For now, let us focus on the values of cesd.1 and cesd.2, which measure the depressive symptom severity at post-test and 3-month follow-up, respectively:

Show result
##-- Variable type: numeric ---------------------------
##   skim_variable n_missing complete_rate  mean    sd 
## 5 cesd.2               81         0.703 22.5  10.4  
## 6 cesd.1               76         0.722 23.1   8.51 
## [...]
## 5 cesd.2               84         0.692 17.7   9.66
## 6 cesd.1               79         0.711 16.4   8.54

The upper rows show the output for the control group and the lower ones for the intervention group. First, we see that, depending on the group and outcome, data are available for 69.2-72.2% of all patients. That means that roughly one-third of all data at post-test and follow-up is missing, which is considerable. We see that the complete_rate is somewhat higher at post-test and approximately similar in both groups. More importantly, we see that, at both assessment points, the mean CES-D value in the intervention group is lower than in the control group. This is a preliminary sign that the intervention could have had an effect on the recruited sample, but of course, a more rigorous test is required to confirm this.

It is also helpful to have a look at the distribution of our primary outcome at baseline, post-test and 3-month follow-up. To create a histogram of all three variables, we can use the multi.hist function included in the psych package (Revelle 2022):

library(psych)
multi.hist(data %>% select(cesd.0, cesd.1, cesd.2), ncol = 3)


There are a few interesting things to see here. First, compared to baseline, the CES-D scores at post-test and follow-up are shifted slightly to the left. This could be caused by an inherent effect of the intervention in the treatment group; but other causes such as regression to the mean or spontaneous remission could also play a role. We also see that, while all three histograms roughly follow a bell curve, the post-test and follow-up values seem to be more dispersed, with more values on the lower end of the scale, but also some very high scores on the other end.

The last step in our descriptive analysis is to quantify to amount of missing data at all assessment points. This part is crucial since loss-to-follow-up is also a mandatory part of the Consolidated Standards of Reporting Trials (CONSORT, Moher et al. 2012) flow chart, which we will be creating later.

To create an overview of the missing outcome data, we focus on our primary outcome, the cesd variables. To count the missings, we can use the very helpful is.na function. This function checks if elements in a vector are missing (NA). It returns TRUE if that is the case, and FALSE otherwise. We can then apply the sum function to this result to count the number of missings in a vector. Here is a small example:

x <- c(1, 2, 3, 4, 5, NA, NA, 200)      # Create vector
sum(is.na(x))                           # Count the number of missings
Show result
## [1] 2

We will now put this function into practice and extract the number of missings in the cesd variables at all three assessment points. To simplify this, we use the with function, which tells R that all code written within the function should be run with our data set data. This means that we do not have to extract each variable from data using the $ operator. We save the result as na.all. Then, we divide the results by the number of rows in data to get the amount of missing data in percent, and save this result as na.all.p.

1with(data, {
  c(sum(is.na(cesd.0)),
    sum(is.na(cesd.1)),
    sum(is.na(cesd.2)))
}) -> na.all

2na.all.p <- na.all/nrow(data)
na.all.p
1
Get the missings in all trial data.
2
Calculate percentages.
Show result
## [1] 0.0000000 0.2838828 0.3021978

We are not only interested in the overall dropout but also in the individual missingness in each randomized group. Therefore, let us repeat the procedure from before for the intervention and control group specifically.

1data %>%
  filter(group == 1) %>%
  with({
    c(sum(is.na(cesd.0)),
      sum(is.na(cesd.1)),
      sum(is.na(cesd.2)))
  }) -> na.ig

na.ig.p <- na.ig/nrow(data %>% filter(group == 1))

2data %>%
  filter(group == 0) %>%
  with({
    c(sum(is.na(cesd.0)),
      sum(is.na(cesd.1)),
      sum(is.na(cesd.2)))
  }) -> na.cg

na.cg.p <- na.cg/nrow(data %>% filter(group == 0))
1
Do the same for intervention group only…
2
… and control group only.

Next, we compile all calculated values in one data frame:

1na <- data.frame(na.all, na.all.p = na.all.p*100,
                 na.ig, na.ig.p = na.ig.p*100,
                 na.cg, na.cg.p = na.cg.p*100)

2rownames(na) = c("t0", "t1", "t2")
round(na, 2)
1
Collect all data in data frame.
2
Give the rows fitting names.
Show result
##    na.all na.all.p na.ig na.ig.p na.cg na.cg.p
## t0      0     0.00     0    0.00     0    0.00
## t1    155    28.39    79   28.94    76   27.84
## t2    165    30.22    84   30.77    81   29.67

We can now use this information to populate the study flow chart. Such flow charts are mandated by the CONSORT reporting guidelines that most biomedical journals have endorsed, and which should be seen as a standard part of every RCT analysis report. It is possible to generate CONSORT flow charts directly in R using the consort package (Dayim 2023). However, especially for beginners, it may be more convenient to simply download the official PDF or MS Word template from the CONSORT website, and then populate it with the numbers that we just calculated.

In the graphic on the next page, we have used the missingness data we calculated above to indicate the number of participants providing data, as well as the number of participants being lost to follow-up at each assessment point. Please note that the CONSORT flow chart also contains information on the screening process (how many individuals were screened, how many were excluded from the trial, and for what reason?). This information is not included in the final RCT data set and is something that must be recorded while the trial is running.


Exemplary CONSORT flow chart for our trial.
The CONSORT Statement

The CONSORT Statement has been first introduced in 1996 to improve the reporting of randomized clinical trials in health research. Its goal is to provide a set of standards that all RCT reports should adhere to. The CONSORT guidelines are supported by almost all relevant journals in the mental health field, so adhering to them is highly recommended.

The latest 2010 version of the CONSORT statement consists of a main paper (Schulz, Altman, and Moher 2010) and an elaboration and explanation article (Moher et al. 2012), both of which are extremely helpful resources for novice RCT analysts. The core parts of the CONSORT Statement are the trial flow chart, which we now learned how to create, and the 25-item CONSORT Checklist, which should be filled out and provided with every RCT report, and which can be downloaded here.