Load data

Author

Vagish Hemmige

The code in this script loads the data needed for other scripts from multiple sources, as explained in detail below.

Source code

The full R script is available at:

R/load_data.R

This R script file is itself reliant on the following helper files:

Load US state maps from census

We import U.S. state boundary shapefiles using the tigris package.

The tigris::states() function retrieves state-level shapefiles from the U.S. Census Bureau’s TIGER/Line products.

Column Selection

We retain only:

NAME → full state name
STUSPS → two-letter state abbreviation
geometry → polygon boundaries

This reduces memory overhead and keeps the object lightweight for downstream joins and plotting.

Exclusions

We remove:

Alaska and Hawaii (to avoid projection distortion in contiguous U.S. maps)
U.S. territories (not included in SRTR transplant center analyses)

This ensures that visualizations align with the contiguous U.S. study population.

Output

The resulting object, states_sf, is an sf object containing polygon geometries for the 48 contiguous U.S. states,

Click to show/hide R Code

states_sf <- tigris::states(cb = TRUE, year = 2022)%>%
  select(NAME, STUSPS, geometry) %>%
  filter(!(NAME %in% c("Alaska", 
                       "Hawaii", 
                       "Puerto Rico", 
                       "United States Virgin Islands",
                       "Commonwealth of the Northern Mariana Islands",
                       "Guam",
                       "American Samoa"))) %>%
  rename(State=NAME)

Import transplant data from UNOS SRTR

To identify active transplant centers and characterize transplant activity by organ type, we import transplant recipient-level and donor-level data from the Scientific Registry of Transplant Recipients (SRTR).

These datasets are accessed using a custom helper function, load_srtr_file(), which standardizes file loading, variable labeling, and preprocessing across SRTR tables.

Click to show/hide R Code

tx<-list()

tx[["kidney"]] <- load_srtr_file("tx_ki")
tx[["liver"]] <- load_srtr_file("tx_li")
tx[["heart"]] <- load_srtr_file("tx_hr")
tx[["lung"]] <- load_srtr_file("tx_lu")
tx[["pancreas"]] <- load_srtr_file("tx_kp")



donor_deceased <- load_srtr_file("donor_deceased")%>%
  select(DONOR_ID, DON_ANTI_HIV, DON_HIV_NAT)

Load transplant center data from SRTR package

To map transplant center locations and define geographic catchment areas, we retrieve center-level data from the Health Resources and Services Administration (HRSA) using the sRtr package.

The get_hrsa_transplant_centers() function accesses HRSA transplant center metadata, including center identifiers and geographic coordinates.

Subsequent steps:

Remove Hawaii and Puerto Rico
Keep the following variables:
- OTC_NM: transplant center name
- OTC_CD: transplant center 4-letter UNOS code
- Service_Lst: list of organs transplanted at that center
- X, Y: longitude and latitude of transplant center
Keep transplant centers whose organ corresponds with the organ defined in the organ_loop variable for each loop
Removes duplicates with the distinct() command
Renames variables for convenience
Pauses for five seconds between iterations to avoid overloading the HRSA website

Click to show/hide R Code

Transplant_centers_all<-list()

for (organ_loop in organ_list)
{
  
  
  Transplant_centers_all[[organ_loop]]<-sRtr::get_hrsa_transplant_centers()%>%
    filter(OPO_STATE_ABBR != "PR" & OPO_STATE_ABBR!= "HI")%>%
    select(OTC_NM, OTC_CD, Service_Lst, X, Y) %>%
    filter(str_detect(Service_Lst, str_to_sentence(organ_loop)))%>%
    select(-Service_Lst)%>%
    distinct()%>%
    rename(OTCName=OTC_NM, OTCCode=OTC_CD, Latitude=Y, Longitude=X)
  Sys.sleep(5)
  
}

Load census data

To estimate catchment populations for transplant centers, we retrieve tract- and county-level population data from the American Community Survey (ACS) 5-year files. Because these data are large and slow to download, we implement a caching strategy to avoid repeated API calls during iterative renders.

Overview

For each year in year_list, this code:

Retrieves tract-level total population
Retrieves county-level total population
Caches results locally as .rds files

This structure ensures reproducibility while dramatically improving performance during Quarto re-rendering.

Tract-Level Population Data

We obtain tract-level total population using ACS variable:

B01003_001 — Total population

If a cached file already exists (cache/us_tracts_population_YEAR.rds), it is loaded directly. Otherwise:

Data are downloaded using get_acs()
Geometry is included (geometry = TRUE) for spatial joins
County FIPS codes are derived from the first five digits of the tract GEOID
Identifier fields are converted to numeric for efficient joins
Only necessary variables are retained
The cleaned dataset is saved locally for future use

Why include geometry?

Including tract geometry allows direct spatial intersection with:

50-mile buffers
60-minute isochrones
United catchment areas

This avoids needing a second shapefile join later.

County-Level Population Data

County-level total population is retrieved using the same ACS variable (B01003_001), with geometry retained.

County data are used for:

County-level HIV prevalence overlays
Aggregated catchment summaries
Validation checks against tract-level totals

As with tract data, results are cached locally to prevent redundant API calls.

Age and Sex Variable Labels (`B01001`)

The ACS table B01001 contains detailed age- and sex-specific population counts. Rather than downloading these counts directly in this step, we load the variable metadata using:

Click to show/hide R Code

us_tracts_population<-list()
us_counties_population<-list()
vars<-list()

for (year_loop in year_list){
  
  
  tracts_path<-paste0("cache/us_tracts_population_",year_loop,".rds")
  counties_path<-paste0("cache/us_counties_population_",year_loop,".rds")
  
  #Pull US Census tract populations if cache file doesn't already exist
  
  if (file.exists(tracts_path)) {
    us_tracts_population[[year_loop]] <- readRDS(tracts_path)
  } else {
    
    us_tracts_population[[year_loop]]<-get_acs(
      geography = "tract",
      variables = "B01003_001",
      year = as.numeric(year_loop),
      survey = "acs5",
      geometry = TRUE,
      state = continental_fips
    )%>%
      mutate(county_fips = substr(GEOID, 1, 5))%>%
      mutate(GEOID=as.numeric(GEOID))%>%
      mutate(county_fips=as.numeric(county_fips))%>%
      select(-variable, -moe)%>%
      rename(tract_population=estimate)
    
    saveRDS(us_tracts_population[[year_loop]], 
            tracts_path)
    
  }
  
  #Pull US Census counties populations if cache file doesn't already exist
  
  if (file.exists(counties_path)) {
    us_counties_population[[year_loop]] <- readRDS(counties_path)
  } else {
    
    us_counties_population[[year_loop]]<-get_acs(
      geography = "county",
      variables = "B01003_001",
      year = as.numeric(year_loop),
      survey = "acs5",
      geometry = TRUE,
      state = continental_fips
    )%>%
      mutate(GEOID=as.numeric(GEOID))%>%
      select(-variable, -moe)%>%
      rename(county_population=estimate)
    
    saveRDS(us_counties_population[[year_loop]], 
            counties_path)
    
    
  }
  
  # Load tract variable labels for age/sex values B01001 from the ACS 5-year 2017 dataset
  vars[[year_loop]] <- load_variables(year_loop, "acs5", cache = TRUE) %>%
    filter(str_detect(name, "B01001_"))%>%
    mutate(label = str_replace_all(label, "Estimate!!Total!!", ""))%>%
    mutate(label = str_replace_all(label, "!!", "_"))%>%
    mutate(label = str_replace_all(label, " ", "_"))
  
}

AtlasPlus county-level HIV data

This code uses the CDCAtlas package to download county-level HIV prevalence data for the years defined in year_list.

Click to show/hide R Code

AtlasPlusTableData_county_totals<-list()

for (year_loop in year_list){
  
  AtlasPlusTableData_county_totals[[year_loop]] <-CDCAtlas::get_atlas(disease="hiv",
                                                                      geography="county",
                                                                      year=as.numeric(year_loop))%>%
    clean_names()%>%
    rename(county_cases=cases)%>%
    rename(geo_id=fips)%>%
    rename(county_name=geography)%>%
    mutate(county_cases=if_else(is.na(county_cases),0, county_cases))%>%
    mutate(geo_id=as.numeric(geo_id))%>%
    filter(year==as.numeric(year_loop))%>%
    select(-year)%>%
    filter(indicator=="HIV prevalence")
  
}

--- title: "Load data" format: html --- The code in this script loads the data needed for other scripts from multiple sources, as explained in detail below. ::: {.Rcode title="Source code"} The full R script is available at: - [`R/load_data.R`](https://github.com/VagishHemmige/HOPE-CDC-analysis-2026/blob/master/R/load_data.R) This R script file is itself reliant on the following helper files: - [`R/setup.R`](https://github.com/VagishHemmige/HOPE-CDC-analysis-2026/blob/master/R/setup.R) - [`R/functions.R`](https://github.com/VagishHemmige/HOPE-CDC-analysis-2026/blob/master/R/functions.R) ::: ## Load US state maps from census We import U.S. state boundary shapefiles using the `tigris` package. The `tigris::states()` function retrieves state-level shapefiles from the U.S. Census Bureau’s TIGER/Line products. ### Column Selection We retain only: - `NAME` → full state name - `STUSPS` → two-letter state abbreviation - `geometry` → polygon boundaries This reduces memory overhead and keeps the object lightweight for downstream joins and plotting. ### Exclusions We remove: - Alaska and Hawaii (to avoid projection distortion in contiguous U.S. maps) - U.S. territories (not included in SRTR transplant center analyses) This ensures that visualizations align with the contiguous U.S. study population. ### Output The resulting object, `states_sf`, is an `sf` object containing polygon geometries for the 48 contiguous U.S. states, ```{r, eval=FALSE} states_sf <- tigris::states(cb = TRUE, year = 2022)%>% select(NAME, STUSPS, geometry) %>% filter(!(NAME %in% c("Alaska", "Hawaii", "Puerto Rico", "United States Virgin Islands", "Commonwealth of the Northern Mariana Islands", "Guam", "American Samoa"))) %>% rename(State=NAME) ``` ## Import transplant data from UNOS SRTR To identify active transplant centers and characterize transplant activity by organ type, we import transplant recipient-level and donor-level data from the Scientific Registry of Transplant Recipients (SRTR). These datasets are accessed using a custom helper function, `load_srtr_file()`, which standardizes file loading, variable labeling, and preprocessing across SRTR tables. ```{r, eval=FALSE} tx<-list() tx[["kidney"]] <- load_srtr_file("tx_ki") tx[["liver"]] <- load_srtr_file("tx_li") tx[["heart"]] <- load_srtr_file("tx_hr") tx[["lung"]] <- load_srtr_file("tx_lu") tx[["pancreas"]] <- load_srtr_file("tx_kp") donor_deceased <- load_srtr_file("donor_deceased")%>% select(DONOR_ID, DON_ANTI_HIV, DON_HIV_NAT) ``` ## Load transplant center data from SRTR package To map transplant center locations and define geographic catchment areas, we retrieve center-level data from the Health Resources and Services Administration (HRSA) using the `sRtr` package. The `get_hrsa_transplant_centers()` function accesses HRSA transplant center metadata, including center identifiers and geographic coordinates. Subsequent steps: - Remove Hawaii and Puerto Rico - Keep the following variables: - `OTC_NM`: transplant center name - `OTC_CD`: transplant center 4-letter UNOS code - `Service_Lst`: list of organs transplanted at that center - `X`, `Y`: longitude and latitude of transplant center - Keep transplant centers whose organ corresponds with the organ defined in the `organ_loop` variable for each loop - Removes duplicates with the `distinct()` command - Renames variables for convenience - Pauses for five seconds between iterations to avoid overloading the HRSA website ```{r, eval=FALSE} Transplant_centers_all<-list() for (organ_loop in organ_list) { Transplant_centers_all[[organ_loop]]<-sRtr::get_hrsa_transplant_centers()%>% filter(OPO_STATE_ABBR != "PR" & OPO_STATE_ABBR!= "HI")%>% select(OTC_NM, OTC_CD, Service_Lst, X, Y) %>% filter(str_detect(Service_Lst, str_to_sentence(organ_loop)))%>% select(-Service_Lst)%>% distinct()%>% rename(OTCName=OTC_NM, OTCCode=OTC_CD, Latitude=Y, Longitude=X) Sys.sleep(5) } ``` ## Load census data To estimate catchment populations for transplant centers, we retrieve tract- and county-level population data from the American Community Survey (ACS) 5-year files. Because these data are large and slow to download, we implement a caching strategy to avoid repeated API calls during iterative renders. ### Overview For each year in `year_list`, this code: - Retrieves tract-level total population - Retrieves county-level total population - Caches results locally as `.rds` files This structure ensures reproducibility while dramatically improving performance during Quarto re-rendering. ### Tract-Level Population Data We obtain tract-level total population using ACS variable: - `B01003_001` — Total population If a cached file already exists (`cache/us_tracts_population_YEAR.rds`), it is loaded directly. Otherwise: - Data are downloaded using `get_acs()` - Geometry is included (`geometry = TRUE`) for spatial joins - County FIPS codes are derived from the first five digits of the tract `GEOID` - Identifier fields are converted to numeric for efficient joins - Only necessary variables are retained - The cleaned dataset is saved locally for future use #### Why include geometry? Including tract geometry allows direct spatial intersection with: - 50-mile buffers - 60-minute isochrones - United catchment areas This avoids needing a second shapefile join later. ### County-Level Population Data County-level total population is retrieved using the same ACS variable (`B01003_001`), with geometry retained. County data are used for: - County-level HIV prevalence overlays - Aggregated catchment summaries - Validation checks against tract-level totals As with tract data, results are cached locally to prevent redundant API calls. ### Age and Sex Variable Labels (`B01001`) The ACS table `B01001` contains detailed age- and sex-specific population counts. Rather than downloading these counts directly in this step, we load the variable metadata using: ```{r, eval=FALSE} us_tracts_population<-list() us_counties_population<-list() vars<-list() for (year_loop in year_list){ tracts_path<-paste0("cache/us_tracts_population_",year_loop,".rds") counties_path<-paste0("cache/us_counties_population_",year_loop,".rds") #Pull US Census tract populations if cache file doesn't already exist if (file.exists(tracts_path)) { us_tracts_population[[year_loop]] <- readRDS(tracts_path) } else { us_tracts_population[[year_loop]]<-get_acs( geography = "tract", variables = "B01003_001", year = as.numeric(year_loop), survey = "acs5", geometry = TRUE, state = continental_fips )%>% mutate(county_fips = substr(GEOID, 1, 5))%>% mutate(GEOID=as.numeric(GEOID))%>% mutate(county_fips=as.numeric(county_fips))%>% select(-variable, -moe)%>% rename(tract_population=estimate) saveRDS(us_tracts_population[[year_loop]], tracts_path) } #Pull US Census counties populations if cache file doesn't already exist if (file.exists(counties_path)) { us_counties_population[[year_loop]] <- readRDS(counties_path) } else { us_counties_population[[year_loop]]<-get_acs( geography = "county", variables = "B01003_001", year = as.numeric(year_loop), survey = "acs5", geometry = TRUE, state = continental_fips )%>% mutate(GEOID=as.numeric(GEOID))%>% select(-variable, -moe)%>% rename(county_population=estimate) saveRDS(us_counties_population[[year_loop]], counties_path) } # Load tract variable labels for age/sex values B01001 from the ACS 5-year 2017 dataset vars[[year_loop]] <- load_variables(year_loop, "acs5", cache = TRUE) %>% filter(str_detect(name, "B01001_"))%>% mutate(label = str_replace_all(label, "Estimate!!Total!!", ""))%>% mutate(label = str_replace_all(label, "!!", "_"))%>% mutate(label = str_replace_all(label, " ", "_")) } ``` ## AtlasPlus county-level HIV data This code uses the `CDCAtlas` package to download county-level HIV prevalence data for the years defined in `year_list`. ```{r eval=FALSE} AtlasPlusTableData_county_totals<-list() for (year_loop in year_list){ AtlasPlusTableData_county_totals[[year_loop]] <-CDCAtlas::get_atlas(disease="hiv", geography="county", year=as.numeric(year_loop))%>% clean_names()%>% rename(county_cases=cases)%>% rename(geo_id=fips)%>% rename(county_name=geography)%>% mutate(county_cases=if_else(is.na(county_cases),0, county_cases))%>% mutate(geo_id=as.numeric(geo_id))%>% filter(year==as.numeric(year_loop))%>% select(-year)%>% filter(indicator=="HIV prevalence") } ```

Load US state maps from census

Column Selection

Exclusions

Output

Import transplant data from UNOS SRTR

Load transplant center data from SRTR package

Load census data

Overview

Tract-Level Population Data

Why include geometry?

County-Level Population Data

Age and Sex Variable Labels (B01001)

AtlasPlus county-level HIV data

Age and Sex Variable Labels (`B01001`)