Modeling

Author

Affiliation

Vagish Hemmige

Montefiore Medical Center/ Albert Einstein College of Medicine

The code in this script runs the statistical models used in the R.figures.R and R.tables.R scripts to generate tables and figures for the manuscript.

Source code

The full R script is available at:

R/modeling.R

This R script file is itself reliant on the following helper files:

Longitudinal cost data models

The following code runs models on the following outcomes as generated by the files in the R/postmatch_processing.R file, to calculate costs per month:

Total costs
Physician/supplier costs
Institutional total claims costs
Institutional Home Health costs
Institutional Inpatient costs
Institutional Outpatient costs
Institutional Skilled Nursing Facility costs
Institutional Dialysis costs
Institutional Auxiliary costs
Institutional Hospice costs
Institutional Home Health costs

Analyzing cost data is notoriously challenging. While guides exist online, experts have also devoted entire books to the topic.

To account for the repeated-observations nature of the data, both generalized estimating equations as well as mixed-effects models with patient as a random effect were explored.

Modeling cost data is challenging due to the highly skewed nature of medical cost data. We explored linear models, log(Cost) models, and Tweedie models.

Accounting for missingness and censoring of costs due to lack of coverage and death is itself a challenging topic, covered well in a recent review. Seminal papers have been written by:

We trialed multiple types of models:

Naive estimation (patients do not contribute observations if uncovered or dead)
Treating post-death costs as zero, ignoring months with costs uncovered (note: not done yet)
Treating post-death costs as zero, weighting to adjust for uncovered months (note: not done yet)

Click to show/hide R Code

# Statistical modeling of cost data

for (outcome in c("grand_total_cost_month", "IN_CLM_month_total", "PS_REV_month_total", "IN_CLM_month_groupedHomeHealth",
                  "IN_CLM_month_groupedHospice","IN_CLM_month_groupedNonclaimauxiliary","IN_CLM_month_groupedDialysis",
                  "IN_CLM_month_groupedOutpatient","IN_CLM_month_groupedInpatient","IN_CLM_month_groupedSkilledNursingFacility")){

  temp_formula[["glmmTMB"]][["untransformed"]]<-as.formula(
    paste0(
      outcome, " ~ ",
      "patient_type * factor(month) + ",
      "offset(log(month_offset)) + ",
      "(1 | USRDS_ID)"
    )
  )
  temp_formula[["gee"]][["untransformed"]]<-as.formula(
    paste0(
      outcome, " ~ ",
      "patient_type * factor(month) + ",
      "offset(log(month_offset))"
    )
  )

  temp_formula[["glmmTMB"]][["logtransformed"]] <- as.formula(
    paste0(
      "log1p(", outcome, ") ~ ",
      "patient_type * factor(month) + ",
      "offset(log(month_offset)) + ",
      "(1 | USRDS_ID)"
    )
  )
  temp_formula[["gee"]][["logtransformed"]]<-as.formula(
    paste0(
      "log1p(", outcome, ") ~ ",
      "patient_type * factor(month) + ",
      "offset(log(month_offset))"
    )
  )
  
  message(
    "Fitting models to formula: ", temp_formula[["glmmTMB"]][["untransformed"]]
  )
  
  message("Fitting glmmTMB to tweedie family...")
  fit[[outcome]][["glmmTMB"]][["tweedie"]] <- glmmTMB(
    temp_formula[["glmmTMB"]][["untransformed"]],
    family = tweedie(link = "log"),
    data = final_data_set
  )
  
  message("Fitting glmmTMB to linear family...")
  fit[[outcome]][["glmmTMB"]][["linear"]] <- glmmTMB(
    temp_formula[["glmmTMB"]][["untransformed"]],
    family = gaussian(link = "identity"),
    data = final_data_set
  )
  
  message("Fitting gee to linear family...")
  fit[[outcome]][["gee"]][["linear"]] <- geeglm(
    temp_formula[["gee"]][["untransformed"]],
    id     = USRDS_ID,
    data   = final_data_set,
    corstr = "exchangeable",
    family = gaussian(link = "identity"))  
  
  message(
    "Fitting models to formula: ", temp_formula[["glmmTMB"]][["logtransformed"]]
  )
  
  message("Fitting glmmTMB to logcost family...")
  fit[[outcome]][["glmmTMB"]][["log"]] <- glmmTMB(
    temp_formula[["glmmTMB"]][["logtransformed"]],
    family = gaussian(link = "identity"),
    data = final_data_set
  )
  
  message("Fitting gee to logcost family...")
  fit[[outcome]][["gee"]][["log"]] <- geeglm(
    temp_formula[["gee"]][["logtransformed"]],
    id     = USRDS_ID,
    data   = final_data_set,
    corstr = "exchangeable",
    family = gaussian(link = "identity"))  

  
}

Other portions of the analysis

Setup: Defines global paths, data sources, cohort inclusion criteria, and analysis-wide constants.
Functions: Reusable helper functions for cohort construction, matching, costing, and modeling.
Create cohort: Constructs the initial time-varying cohort of kidney transplant recipients, defining cohort entry, follow-up structure, and case/control eligibility prior to matching.
Execute matching: Implements risk-set–based greedy matching without replacement to construct the analytic cohort.
Post-match processing: Derives analytic variables, time-aligned cost windows, and follow-up structure after matching.
Modeling: Fits prespecified cost and outcome models using the matched cohort.
Tables: Summary tables and regression outputs generated from the final models.
Figures:Visualizations of costs, risks, and model-based estimates.
About: methods, assumptions, and disclosures

--- title: "Modeling" format: html --- The code in this script runs the statistical models used in the `R.figures.R` and `R.tables.R` scripts to generate tables and figures for the manuscript. ::: Rcode ### Source code The full R script is available at: - [`R/modeling.R`](https://github.com/VagishHemmige/Cryptococcus-cost_analysis/blob/master/R/modeling.R) This R script file is itself reliant on the following helper files: - [`R/setup.R`](https://github.com/VagishHemmige/Cryptococcus-cost_analysis/blob/master/R/setup.R) - [`R/functions.R`](https://github.com/VagishHemmige/Cryptococcus-cost_analysis/blob/master/R/functions.R) ::: ## Longitudinal cost data models The following code runs models on the following outcomes as generated by the files in the `R/postmatch_processing.R` file, to calculate costs per month: - Total costs - Physician/supplier costs - Institutional total claims costs - Institutional Home Health costs - Institutional Inpatient costs - Institutional Outpatient costs - Institutional Skilled Nursing Facility costs - Institutional Dialysis costs - Institutional Auxiliary costs - Institutional Hospice costs - Institutional Home Health costs Analyzing cost data is notoriously challenging. While [guides exist online](https://www.herc.research.va.gov/include/page.asp?id=analyzing-cost-data), experts have also devoted [entire books](https://www.stata.com/bookstore/health-econometrics-using-stata/) to the topic. To account for the repeated-observations nature of the data, both [generalized estimating equations](https://en.wikipedia.org/wiki/Generalized_estimating_equation) as well as [mixed-effects models](https://en.wikipedia.org/wiki/Mixed_model) with patient as a random effect were explored. Modeling cost data is challenging due to the highly skewed nature of medical cost data. We explored linear models, log(Cost) models, and [Tweedie](https://mbounthavong.github.io/Tweedie-distribution/) models. Accounting for missingness and censoring of costs due to lack of coverage and death is itself a challenging topic, covered well in a [recent review](https://pmc.ncbi.nlm.nih.gov/articles/PMC3377439/). Seminal papers have been written by: - [Lin *et al*](https://dlin.web.unc.edu/wp-content/uploads/sites/1568/2013/04/LinEA97.pdf) - [Bang *et al*](https://www.jstor.org/stable/2673467) We trialed multiple types of models: - Naive estimation (patients do not contribute observations if uncovered or dead) - Treating post-death costs as zero, ignoring months with costs uncovered (note: not done yet) - Treating post-death costs as zero, weighting to adjust for uncovered months (note: not done yet) ```{r eval=FALSE} # Statistical modeling of cost data for (outcome in c("grand_total_cost_month", "IN_CLM_month_total", "PS_REV_month_total", "IN_CLM_month_groupedHomeHealth", "IN_CLM_month_groupedHospice","IN_CLM_month_groupedNonclaimauxiliary","IN_CLM_month_groupedDialysis", "IN_CLM_month_groupedOutpatient","IN_CLM_month_groupedInpatient","IN_CLM_month_groupedSkilledNursingFacility")){ temp_formula[["glmmTMB"]][["untransformed"]]<-as.formula( paste0( outcome, " ~ ", "patient_type * factor(month) + ", "offset(log(month_offset)) + ", "(1 | USRDS_ID)" ) ) temp_formula[["gee"]][["untransformed"]]<-as.formula( paste0( outcome, " ~ ", "patient_type * factor(month) + ", "offset(log(month_offset))" ) ) temp_formula[["glmmTMB"]][["logtransformed"]] <- as.formula( paste0( "log1p(", outcome, ") ~ ", "patient_type * factor(month) + ", "offset(log(month_offset)) + ", "(1 | USRDS_ID)" ) ) temp_formula[["gee"]][["logtransformed"]]<-as.formula( paste0( "log1p(", outcome, ") ~ ", "patient_type * factor(month) + ", "offset(log(month_offset))" ) ) message( "Fitting models to formula: ", temp_formula[["glmmTMB"]][["untransformed"]] ) message("Fitting glmmTMB to tweedie family...") fit[[outcome]][["glmmTMB"]][["tweedie"]] <- glmmTMB( temp_formula[["glmmTMB"]][["untransformed"]], family = tweedie(link = "log"), data = final_data_set ) message("Fitting glmmTMB to linear family...") fit[[outcome]][["glmmTMB"]][["linear"]] <- glmmTMB( temp_formula[["glmmTMB"]][["untransformed"]], family = gaussian(link = "identity"), data = final_data_set ) message("Fitting gee to linear family...") fit[[outcome]][["gee"]][["linear"]] <- geeglm( temp_formula[["gee"]][["untransformed"]], id = USRDS_ID, data = final_data_set, corstr = "exchangeable", family = gaussian(link = "identity")) message( "Fitting models to formula: ", temp_formula[["glmmTMB"]][["logtransformed"]] ) message("Fitting glmmTMB to logcost family...") fit[[outcome]][["glmmTMB"]][["log"]] <- glmmTMB( temp_formula[["glmmTMB"]][["logtransformed"]], family = gaussian(link = "identity"), data = final_data_set ) message("Fitting gee to logcost family...") fit[[outcome]][["gee"]][["log"]] <- geeglm( temp_formula[["gee"]][["logtransformed"]], id = USRDS_ID, data = final_data_set, corstr = "exchangeable", family = gaussian(link = "identity")) } ``` ## Other portions of the analysis - [**Setup**](setup.qmd): Defines global paths, data sources, cohort inclusion criteria, and analysis-wide constants. - [**Functions**](functions.qmd): Reusable helper functions for cohort construction, matching, costing, and modeling. - [**Create cohort**](create_cohort.qmd): Constructs the initial time-varying cohort of kidney transplant recipients, defining cohort entry, follow-up structure, and case/control eligibility prior to matching. - [**Execute matching**](execute_matching.qmd): Implements risk-set–based greedy matching without replacement to construct the analytic cohort. - [**Post-match processing**](postmatch_processing.qmd): Derives analytic variables, time-aligned cost windows, and follow-up structure after matching. - [**Modeling**](modeling.qmd): Fits prespecified cost and outcome models using the matched cohort. - [**Tables**](tables.qmd): Summary tables and regression outputs generated from the final models. - [**Figures**](figures.qmd):Visualizations of costs, risks, and model-based estimates. - [**About**](about.qmd): methods, assumptions, and disclosures