Montefiore Medical Center/ Albert Einstein College of Medicine
Overview
The code in this script uses the cohort of kidney transplant patients with and without cryptococcosis developed in the R/create_cohort.R and executes n:1 matching based on a custom-written greedy algorithm
Characteristics of the matching algorithm:
Matching without replacement: Each patient without cryptococcosis only serves as a control to one case at most. This makes subsequent statistical analysis simpler.
Risk set sampling (incidence density sampling): If a case patient develops cryptococcosis at 243 days after their most recent kidney transplant, then all potential controls are chosen based on their covariate values at exactly 243 days post-transplant.
Greedy matching: Cases are ranked by the number of potential controls at the onset, and then matches chosen without replacement without subsequent reordering.
Mix of exact matching** and caliper matching
Exact matching on the following:
Cirrhosis
HIV
Diabetes
Caliper matching on the following:
Age (within 10 years on the index date, calculated exactly)
Index date of case and controls (within 3 years of each other)
The following code prepares the cohort for matching by creating a list object unmatched_patients, keeping only the initial row for each case as only the covariates at the time of cryptococcus are necessary for matching.
Click to show/hide R Code
unmatched_patients<-list()unmatched_patients[["Case"]]<-prematching_cohort%>%filter(cryptococcus_case=="Case")unmatched_patients[["Control"]]<-prematching_cohort%>%filter(cryptococcus_case=="Potential control")unmatched_patients[["Case"]]<-unmatched_patients[["Case"]]%>%group_by(USRDS_ID)%>%slice(1)%>%ungroup()#For exploration--perhaps can delete in future?Case<-unmatched_patients[["Case"]]Control<-unmatched_patients[["Control"]]
Counting the potental number of matched controls for each case
This step uses the calculate_number_potential_matches() function defined in the R/functions.R to add a column to the cases data set, then sorts the cases so that those with the lowest number of potential matches are matched first.
(Note the following code took ~2 hours to execute on a laptop given the size of the dataset.)
Click to show/hide R Code
#Add number of potential control matches to Case dfunmatched_patients[["Case"]] <- unmatched_patients[["Case"]] %>%mutate(num_potential_controls =pmap_int(list(USRDS_ID, BORN,cirrhosis, CMV, HIV, diabetes, cryptococcus_dx_date, tstart),function(USRDS_ID, BORN, cirrhosis, CMV, HIV, diabetes, matching_date, tstart) {calculate_number_potential_matches( unmatched_patients[["Control"]], USRDS_ID, BORN, cirrhosis, CMV, HIV, diabetes, matching_date, tstart ) } ) )#Sort by number of potential matches so that most difficult to match cases are matched first#Set the index_date for cases to be the date of cryptococcus dxunmatched_patients[["Case"]] <- unmatched_patients[["Case"]] %>%arrange(num_potential_controls)%>%mutate(index_date_match=cryptococcus_dx_date)#Can be deletedunmatched_patients_backup<-unmatched_patientsunmatched_patients<-unmatched_patients_backup
Create a matched data set
The following code executes the match, creating a matched dataframe suitable for use in subsequent analysis.
Click to show/hide R Code
#Time to set up the matching loop#Initial the loop controlcontinue<-TRUEi<-0set.seed(12345)#Initialize the object where the matched cases and controls will be storedmatched_patients<-list()matched_patients[["Case"]]<-unmatched_patients[["Case"]][0, ]%>%mutate(matched_control_indices =list())matched_patients[["Control"]] <- unmatched_patients[["Control"]][0, ] %>%mutate(index_date_match =as.Date(character()))#Perform the matchwhile (continue==TRUE){ i<-i+1print(i)#Select matches from controls eligible_controls <-unmatched_patients[["Control"]]%>%filter(cirrhosis==unmatched_patients$Case$cirrhosis[[1]])%>%#filter(CMV==unmatched_patients$Case$CMV[[1]])%>%filter(HIV==unmatched_patients$Case$HIV[[1]])%>%filter(diabetes==unmatched_patients$Case$diabetes[[1]])%>%#Risk set matchingfilter(tstart<=unmatched_patients$Case$tstart[[1]])%>%filter(tstop>unmatched_patients$Case$tstart[[1]])%>%#Calculate date for age calculations/etc.mutate(.baseline_control_date=unmatched_patients$Case$tstart[[1]] + most_recent_transplant_date)%>%#Make sure case and control are sampled within 3 years of each otherfilter(abs(time_length(interval(.baseline_control_date, unmatched_patients$Case$cryptococcus_dx_date[[1]]), "years")) <=3)%>%#filter(cohort_start_date<=unmatched_patients$Case$cryptococcus_dx_date[[1]])%>%#filter(cohort_stop_date>unmatched_patients$Case$cryptococcus_dx_date[[1]])%>%#Age>=18 on index datefilter(time_length(interval(BORN, .baseline_control_date), "years") >=18)%>%#Age difference under 10 years, calculated at sampling datefilter(abs(time_length(interval(BORN,.baseline_control_date), "years")-time_length(interval(unmatched_patients$Case$BORN[[1]], unmatched_patients$Case$cryptococcus_dx_date[[1]]), "years")) <=10)%>%verify_medicare_primary(index_date =".baseline_control_date", medicare_coverage_df = medicare_history, coverage_start_variable ="medicare_coverage_start_date",coverage_end_variable ="medicare_coverage_end_date",cache=TRUE)%>%filter(medicare_primary_TF==TRUE)%>%#Move matches for cumulative number of transplants to the top#Note: this step could be used for propensity score matching on various variables#Consider minimizing date difference and age difference as wellarrange(desc(cumulative_transplant_total==unmatched_patients$Case$cumulative_transplant_total[[1]]))#Number of controls is minimum of global constant number_controls_per_case and number of distinct USRDS_IDs k <-min(number_controls_per_case, nrow(distinct(eligible_controls, USRDS_ID)))#Initialize sampled_controls df sampled_controls<-eligible_controls[0, ]#Loop to select controls one at a time to ensure same patient is not sampled multiple times while (k>0) { sampled_controls <-bind_rows( sampled_controls, eligible_controls %>%slice_head(n =1)) eligible_controls<-eligible_controls%>%filter(!(USRDS_ID %in% sampled_controls$USRDS_ID )) k<-min(k-1, nrow(distinct(eligible_controls, USRDS_ID))) } sampled_controls<-sampled_controls%>%mutate(index_date_match = .baseline_control_date)%>%select(-.baseline_control_date)#Add case to matched DF matched_patients[["Case"]]<-unmatched_patients$Case[1,]%>%mutate(matched_control_indices=list(sampled_controls$USRDS_ID))%>%bind_rows(matched_patients[["Case"]])#Add controls to control DF matched_patients[["Control"]]<-matched_patients[["Control"]]%>%bind_rows(sampled_controls)#Remove controls unmatched_patients[["Control"]]<-unmatched_patients[["Control"]]%>%filter(!(USRDS_ID %in% sampled_controls$USRDS_ID ))#Remove cases unmatched_patients[["Case"]]<-unmatched_patients[["Case"]]%>%slice(-1)if (nrow(unmatched_patients[["Case"]])==0){ continue<-FALSE } if (i==10000){ continue<-FALSE }}post_match_results<-matched_patients%>%bind_rows(.id ="patient_type")
The modeling script uses the above matched dataframe for the actual cost analysis.
Other portions of the analysis
Setup: Defines global paths, data sources, cohort inclusion criteria, and analysis-wide constants.
Functions: Reusable helper functions for cohort construction, matching, costing, and modeling.
Create cohort: Constructs the initial time-varying cohort of kidney transplant recipients, defining cohort entry, follow-up structure, and case/control eligibility prior to matching.
Execute matching: Implements risk-set–based greedy matching without replacement to construct the analytic cohort.
Post-match processing: Derives analytic variables, time-aligned cost windows, and follow-up structure after matching.
Modeling: Fits prespecified cost and outcome models using the matched cohort.
Tables: Summary tables and regression outputs generated from the final models.
Figures:Visualizations of costs, risks, and model-based estimates.
---title: "Execute matching"format: html---## OverviewThe code in this script uses the cohort of kidney transplant patients with and without cryptococcosis developed in the [`R/create_cohort.R`](create_cohort.qmd)and executes n:1 matching based on a custom-written greedy algorithmCharacteristics of the matching algorithm:- **[Matching without replacement:](https://pmc.ncbi.nlm.nih.gov/articles/PMC2943670/)** Each patient without cryptococcosis only serves as a control to one case at most. This makes subsequent statistical analysis simpler.- **[Risk set sampling (incidence density sampling):](https://epiville.ccnmtl.columbia.edu/popup/sampling_techniques.html)** If a case patient develops cryptococcosis at 243 days after their most recent kidney transplant, then all potential controls arechosen based on their covariate values at exactly 243 days post-transplant.- **[Greedy matching:](https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Data_Matching-Optimal_and_Greedy.pdf)** Cases are ranked by the number of potential controls at the onset, and then matches chosen without replacement without subsequent reordering.- **Mix of **exact matching** and **caliper matching** - Exact matching on the following: - Cirrhosis - HIV - Diabetes - Caliper matching on the following: - Age (within 10 years on the index date, calculated exactly) - Index date of case and controls (within 3 years of each other) - Preferential matching on the following: - Number of transplants prior to index date ::: Rcode### Source codeThe full R script is available at:- [`R/execute_matching.R`](https://github.com/VagishHemmige/Cryptococcus-cost_analysis/blob/master/R/execute_matching.R)This R script file is itself reliant on the following helper files:- [`R/setup.R`](https://github.com/VagishHemmige/Cryptococcus-cost_analysis/blob/master/R/setup.R)- [`R/functions.R`](https://github.com/VagishHemmige/Cryptococcus-cost_analysis/blob/master/R/functions.R):::## Prepare cohort for matchingThe following code prepares the cohort for matching by creating a list object `unmatched_patients`, keeping only the initial row for each case as onlythe covariates at the time of cryptococcus are necessary for matching.```{r, eval=FALSE}unmatched_patients<-list()unmatched_patients[["Case"]]<-prematching_cohort%>% filter(cryptococcus_case=="Case")unmatched_patients[["Control"]]<-prematching_cohort%>% filter(cryptococcus_case=="Potential control")unmatched_patients[["Case"]]<-unmatched_patients[["Case"]]%>% group_by(USRDS_ID)%>% slice(1)%>% ungroup()#For exploration--perhaps can delete in future?Case<-unmatched_patients[["Case"]]Control<-unmatched_patients[["Control"]]```## Counting the potental number of matched controls for each caseThis step uses the `calculate_number_potential_matches()` function defined in the [`R/functions.R`](functions.qmd) to add a column to the cases data set, thensorts the cases so that those with the lowest number of potential matches are matched first.(Note the following code took ~2 hours to execute on a laptop given the size of the dataset.)```{r, eval=FALSE}#Add number of potential control matches to Case dfunmatched_patients[["Case"]] <- unmatched_patients[["Case"]] %>% mutate( num_potential_controls = pmap_int( list(USRDS_ID, BORN,cirrhosis, CMV, HIV, diabetes, cryptococcus_dx_date, tstart), function(USRDS_ID, BORN, cirrhosis, CMV, HIV, diabetes, matching_date, tstart) { calculate_number_potential_matches( unmatched_patients[["Control"]], USRDS_ID, BORN, cirrhosis, CMV, HIV, diabetes, matching_date, tstart ) } ) )#Sort by number of potential matches so that most difficult to match cases are matched first#Set the index_date for cases to be the date of cryptococcus dxunmatched_patients[["Case"]] <- unmatched_patients[["Case"]] %>% arrange(num_potential_controls)%>% mutate(index_date_match=cryptococcus_dx_date)#Can be deletedunmatched_patients_backup<-unmatched_patientsunmatched_patients<-unmatched_patients_backup```## Create a matched data setThe following code executes the match, creating a matched dataframe suitable for use in subsequent analysis.```{r eval=FALSE}#Time to set up the matching loop#Initial the loop controlcontinue<-TRUEi<-0set.seed(12345)#Initialize the object where the matched cases and controls will be storedmatched_patients<-list()matched_patients[["Case"]]<-unmatched_patients[["Case"]][0, ]%>% mutate(matched_control_indices = list())matched_patients[["Control"]] <- unmatched_patients[["Control"]][0, ] %>% mutate(index_date_match = as.Date(character()))#Perform the matchwhile (continue==TRUE){ i<-i+1 print(i) #Select matches from controls eligible_controls <-unmatched_patients[["Control"]]%>% filter(cirrhosis==unmatched_patients$Case$cirrhosis[[1]])%>% #filter(CMV==unmatched_patients$Case$CMV[[1]])%>% filter(HIV==unmatched_patients$Case$HIV[[1]])%>% filter(diabetes==unmatched_patients$Case$diabetes[[1]])%>% #Risk set matching filter(tstart<=unmatched_patients$Case$tstart[[1]])%>% filter(tstop>unmatched_patients$Case$tstart[[1]])%>% #Calculate date for age calculations/etc. mutate(.baseline_control_date=unmatched_patients$Case$tstart[[1]] + most_recent_transplant_date)%>% #Make sure case and control are sampled within 3 years of each other filter(abs(time_length(interval(.baseline_control_date, unmatched_patients$Case$cryptococcus_dx_date[[1]]), "years")) <=3)%>% #filter(cohort_start_date<=unmatched_patients$Case$cryptococcus_dx_date[[1]])%>% #filter(cohort_stop_date>unmatched_patients$Case$cryptococcus_dx_date[[1]])%>% #Age>=18 on index date filter(time_length(interval(BORN, .baseline_control_date), "years") >= 18)%>% #Age difference under 10 years, calculated at sampling date filter(abs(time_length(interval(BORN,.baseline_control_date), "years")- time_length(interval(unmatched_patients$Case$BORN[[1]], unmatched_patients$Case$cryptococcus_dx_date[[1]]), "years")) <=10)%>% verify_medicare_primary(index_date = ".baseline_control_date", medicare_coverage_df = medicare_history, coverage_start_variable = "medicare_coverage_start_date", coverage_end_variable = "medicare_coverage_end_date", cache=TRUE)%>% filter(medicare_primary_TF==TRUE)%>% #Move matches for cumulative number of transplants to the top #Note: this step could be used for propensity score matching on various variables #Consider minimizing date difference and age difference as well arrange(desc(cumulative_transplant_total==unmatched_patients$Case$cumulative_transplant_total[[1]])) #Number of controls is minimum of global constant number_controls_per_case and number of distinct USRDS_IDs k <- min(number_controls_per_case, nrow(distinct(eligible_controls, USRDS_ID))) #Initialize sampled_controls df sampled_controls<-eligible_controls[0, ] #Loop to select controls one at a time to ensure same patient is not sampled multiple times while (k>0) { sampled_controls <- bind_rows( sampled_controls, eligible_controls %>%slice_head(n = 1)) eligible_controls<-eligible_controls%>% filter(!(USRDS_ID %in% sampled_controls$USRDS_ID )) k<-min(k-1, nrow(distinct(eligible_controls, USRDS_ID))) } sampled_controls<-sampled_controls%>% mutate(index_date_match = .baseline_control_date)%>% select(-.baseline_control_date) #Add case to matched DF matched_patients[["Case"]]<-unmatched_patients$Case[1,]%>% mutate(matched_control_indices=list(sampled_controls$USRDS_ID))%>% bind_rows(matched_patients[["Case"]]) #Add controls to control DF matched_patients[["Control"]]<-matched_patients[["Control"]]%>% bind_rows(sampled_controls) #Remove controls unmatched_patients[["Control"]]<-unmatched_patients[["Control"]]%>% filter(!(USRDS_ID %in% sampled_controls$USRDS_ID )) #Remove cases unmatched_patients[["Case"]]<-unmatched_patients[["Case"]]%>% slice(-1) if (nrow(unmatched_patients[["Case"]])==0){ continue<-FALSE } if (i==10000){ continue<-FALSE }}post_match_results<-matched_patients%>% bind_rows(.id = "patient_type")```The [modeling](modeling.qmd) script uses the above matched dataframe for the actual cost analysis.## Other portions of the analysis- [**Setup**](setup.qmd): Defines global paths, data sources, cohort inclusion criteria, and analysis-wide constants.- [**Functions**](functions.qmd): Reusable helper functions for cohort construction, matching, costing, and modeling.- [**Create cohort**](create_cohort.qmd): Constructs the initial time-varying cohort of kidney transplant recipients, defining cohort entry, follow-up structure, and case/control eligibility prior to matching.- [**Execute matching**](execute_matching.qmd): Implements risk-set–based greedy matching without replacement to construct the analytic cohort.- [**Post-match processing**](postmatch_processing.qmd): Derives analytic variables, time-aligned cost windows, and follow-up structure after matching.- [**Modeling**](modeling.qmd): Fits prespecified cost and outcome models using the matched cohort.- [**Tables**](tables.qmd): Summary tables and regression outputs generated from the final models.- [**Figures**](figures.qmd):Visualizations of costs, risks, and model-based estimates.- [**About**](about.qmd): methods, assumptions, and disclosures