
Building Cohorts with STROBE Diagrams
Source:vignettes/building-cohort-with-strobe.Rmd
building-cohort-with-strobe.Rmd
Introduction
The STROBE (Strengthening the Reporting of Observational Studies in
Epidemiology) statement provides guidelines for reporting observational
studies. A key component is the flow diagram that shows how the study
population was selected. The strobe
package provides tools
to create STROBE-compliant flow diagrams while building your cohort.
This vignette demonstrates how to use the functions in the strobe package to:
- Initialize a cohort with inclusion criteria
- Apply sequential filters with exclusion tracking
- Review the complete filtering log
Basic STROBE Workflow
The STROBE workflow follows these steps:
-
Initialize the cohort with
strobe_initialize()
-
Filter sequentially with
strobe_filter()
-
Review the log with
get_strobe_log()
Example: Cytomegalovirus
This example requires the medicaldata
package from CRAN
to demonstrate results.
#Obtain data
library(medicaldata)
#> Warning: package 'medicaldata' was built under R version 4.3.3
data(cytomegalovirus)
#Create cohort
df<-cytomegalovirus %>%
strobe_initialize(inclusion_label = "Initial transplant cohort") %>%
strobe_filter(
condition = "age >= 30",
inclusion_label = "Age ≥ 30",
exclusion_reason = "Excluded: Age < 30"
) %>%
strobe_filter(
condition = "recipient.cmv == 1",
inclusion_label = "CMV positive recipients",
exclusion_reason = "Excluded: CMV negative"
) %>%
strobe_filter(
condition = "prior.transplant == 0",
inclusion_label = "No prior transplant",
exclusion_reason = "Excluded: Prior transplant"
)%>%
dplyr::select(age, recipient.cmv, donor.cmv, prior.transplant, sex, race)
#Show first rows and selected columns of final DF
head(df)
#> age recipient.cmv donor.cmv prior.transplant sex race
#> 1 61 1 0 0 1 0
#> 3 63 1 1 0 0 1
#> 4 33 1 0 0 0 1
#> 5 54 1 1 0 0 1
#> 6 55 1 1 0 1 1
#> 7 67 1 1 0 1 1
Reviewing the Filtering Log
get_strobe_log()
#> # A tibble: 4 × 7
#> id parent inclusion_label exclusion_reason filter remaining dropped
#> <chr> <chr> <chr> <chr> <chr> <int> <int>
#> 1 start NA Initial transplant coh… NA NA 64 NA
#> 2 step1 start Age ≥ 30 Excluded: Age <… age >… 63 1
#> 3 step2 step1 CMV positive recipients Excluded: CMV n… recip… 40 23
#> 4 step3 step2 No prior transplant Excluded: Prior… prior… 34 6
The log shows:
- Each filtering step
- Number of records included/excluded at each step
- Cumulative counts
- Exclusion reasons
Advanced Examples
Complex Multi-Step Filtering
cytomegalovirus %>%
strobe_initialize(inclusion_label = "All transplant recipients") %>%
strobe_filter(
condition = "age >= 18 & age <= 65",
inclusion_label = "Age 18–65",
exclusion_reason = "Excluded: Outside 18–65"
) %>%
strobe_filter(
condition = "recipient.cmv == 1",
inclusion_label = "CMV positive",
exclusion_reason = "Excluded: CMV negative"
) %>%
strobe_filter(
condition = "donor.cmv == 1",
inclusion_label = "Donor CMV positive",
exclusion_reason = "Excluded: Donor CMV negative"
) %>%
strobe_filter(
condition = "prior.transplant == 0",
inclusion_label = "No prior transplant",
exclusion_reason = "Excluded: Prior transplant"
)%>%
#Show first rows and selected columns of final DF
dplyr::select(age, recipient.cmv, donor.cmv, prior.transplant, sex, race)%>%
head()
#> age recipient.cmv donor.cmv prior.transplant sex race
#> 3 63 1 1 0 0 1
#> 5 54 1 1 0 0 1
#> 6 55 1 1 0 1 1
#> 16 61 1 1 0 0 1
#> 17 62 1 1 0 0 1
#> 19 62 1 1 0 1 0
The filter log from the above pipeline is below:
get_strobe_log()
#> # A tibble: 5 × 7
#> id parent inclusion_label exclusion_reason filter remaining dropped
#> <chr> <chr> <chr> <chr> <chr> <int> <int>
#> 1 start NA All transplant recipie… NA NA 64 NA
#> 2 step1 start Age 18–65 Excluded: Outsi… age >… 63 1
#> 3 step2 step1 CMV positive Excluded: CMV n… recip… 39 24
#> 4 step3 step2 Donor CMV positive Excluded: Donor… donor… 28 11
#> 5 step4 step3 No prior transplant Excluded: Prior… prior… 23 5
Terminal branching
Not uncommonly, the terminal step of a STROBE diagram is used, not to
show inclusions or exclusions, but how the final cohort is stratified
based on the value of a factor variable. The
create_terminal_branch
function is designed for this
purpose, and can handle factor variables with up to six levels
(including missing values).
The optional label_prefix argument allows you to prepend text (e.g.,
the variable name) to each terminal group label. For example, setting
label_prefix
= “cgvhd:” will produce labels like ‘cgvhd: 0’
and ‘cgvhd: 1’.
Below, we use the variable cgvhd
(chronic graft vs. host
disease) to stratify the final inclusion cohort into two groups:
df<-create_terminal_branch(df, variable = "cgvhd", label_prefix="CGVHD value:")
get_strobe_log()
#> # A tibble: 5 × 7
#> id parent inclusion_label exclusion_reason filter remaining dropped
#> <chr> <chr> <chr> <chr> <chr> <int> <int>
#> 1 start NA All transplant recipie… NA NA 64 NA
#> 2 step1 start Age 18–65 Excluded: Outsi… age >… 63 1
#> 3 step2 step1 CMV positive Excluded: CMV n… recip… 39 24
#> 4 step3 step2 Donor CMV positive Excluded: Donor… donor… 28 11
#> 5 step4 step3 No prior transplant Excluded: Prior… prior… 23 5
See Also
?strobe_initialize
?strobe_filter
?get_strobe_log
?create_factor_variable