Extrapolation to tracts • CDCAtlas

library(CDCAtlas)
library(dplyr)
library(ggplot2)

Overview

CDC AtlasPlus reports many outcomes at the county level. For some analyses, especially geographic access analyses, we may want estimates at a smaller geographic unit such as the Census tract.

CDCAtlas includes helper functions for extrapolating county-level AtlasPlus counts to Census tracts using tract-level Census denominators.

The basic idea is:

Retrieve county-level CDC AtlasPlus counts.
Retrieve tract-level Census population denominators.
Match each tract to its parent county.
Allocate each county’s reported cases across tracts in proportion to the tract’s share of the relevant county population.

For example, if a tract contains 2% of a county’s population in the relevant denominator, it receives 2% of the county’s reported cases.

This is a population-weighted extrapolation, not a direct CDC tract-level surveillance estimate.

Census data requirements

Tract-level extrapolation requires tract-level population denominators from the U.S. Census Bureau. CDCAtlas uses the tidycensus package to retrieve these denominators.

If you only use CDCAtlas to retrieve AtlasPlus county, state, or national data, you do not need to set up tidycensus. However, if you use extrapolate_to_tract = TRUE, you will need:

the tidycensus package installed
a Census API key
the API key saved in your R environment

You can install tidycensus from CRAN:

install.packages("tidycensus")

Then request a free Census API key from the U.S. Census Bureau and install it using tidycensus::census_api_key().

See https://walker-data.com/tidycensus/ for further details.

Important assumptions

This method assumes that, within each county and stratum, cases are distributed across tracts in proportion to the relevant Census denominator.

For example:

unstratified HIV prevalence may be allocated using total tract population
sex-stratified counts may be allocated using tract population by sex
age-stratified counts may be allocated using tract population by age
race/ethnicity-stratified counts may be allocated using tract population by race/ethnicity

This can be useful for geographic access models, but it should not be interpreted as observed tract-level surveillance data.

Basic unstratified extrapolation

The simplest use case is extrapolating county-level data to tracts using total tract population.

hiv_tract <- get_atlas(
  disease = "hiv",
  year = 2022,
  geography = "county",
  extrapolate_to_tract = TRUE
)

The returned data contain one row per tract, with the county-level AtlasPlus count allocated to tracts.

head(hiv_tract)
#> # A tibble: 6 × 26
#>    year tract_fips  tract_name            county_fips race_ethnicity sex   age  
#>   <dbl> <chr>       <chr>                 <chr>       <chr>          <chr> <chr>
#> 1  2022 01001020100 Census Tract 201; Au… 01001       All races/eth… Both… Ages…
#> 2  2022 01001020100 Census Tract 201; Au… 01001       All races/eth… Both… Ages…
#> 3  2022 01001020200 Census Tract 202; Au… 01001       All races/eth… Both… Ages…
#> 4  2022 01001020200 Census Tract 202; Au… 01001       All races/eth… Both… Ages…
#> 5  2022 01001020300 Census Tract 203; Au… 01001       All races/eth… Both… Ages…
#> 6  2022 01001020300 Census Tract 203; Au… 01001       All races/eth… Both… Ages…
#> # ℹ 19 more variables: tract_population_acs <dbl>, county_population_acs <dbl>,
#> #   indicator <fct>, county_name <fct>, data_status <fct>, transmission <fct>,
#> #   rate100000 <dbl>, county_population_atlas <dbl>, lowerci_rate <dbl>,
#> #   upperci_rate <dbl>, rse <dbl>, lowerci_cases <dbl>, upperci_cases <dbl>,
#> #   state_fips <chr>, state_name <fct>, state_cases <dbl>, county_cases <dbl>,
#> #   tract_cases <dbl>, tract_noncases <dbl>

A typical output includes columns such as:

names(hiv_tract)
#>  [1] "year"                    "tract_fips"             
#>  [3] "tract_name"              "county_fips"            
#>  [5] "race_ethnicity"          "sex"                    
#>  [7] "age"                     "tract_population_acs"   
#>  [9] "county_population_acs"   "indicator"              
#> [11] "county_name"             "data_status"            
#> [13] "transmission"            "rate100000"             
#> [15] "county_population_atlas" "lowerci_rate"           
#> [17] "upperci_rate"            "rse"                    
#> [19] "lowerci_cases"           "upperci_cases"          
#> [21] "state_fips"              "state_name"             
#> [23] "state_cases"             "county_cases"           
#> [25] "tract_cases"             "tract_noncases"

You should expect columns identifying:

the tract
the parent county
the AtlasPlus disease and year
the county-level cases
the tract denominator
the county denominator
the tract share of the county denominator
the extrapolated tract cases

The key calculation is:

tract_cases = county_cases * tract_population / county_population

or, more generally:

tract_cases = county_cases * tract_denominator / county_denominator

Checking county totals

After extrapolation, tract-level estimates should sum back to the original county-level counts, allowing for small floating-point differences.

hiv_tract %>%
  group_by(county_fips, indicator) %>%
  summarize(
    county_cases = first(county_cases),
    tract_cases_sum = sum(tract_cases, na.rm = TRUE),
    difference = tract_cases_sum - county_cases,
    .groups = "drop"
  ) %>%
  arrange(desc(abs(difference))) %>%
  head()
#> # A tibble: 6 × 5
#>   county_fips indicator      county_cases tract_cases_sum difference
#>   <chr>       <fct>                 <dbl>           <dbl>      <dbl>
#> 1 51678       HIV prevalence        559.            559.    1.14e-13
#> 2 51530       HIV prevalence        462.            462.   -5.68e-14
#> 3 17187       HIV prevalence        126.            126.   -1.42e-14
#> 4 22107       HIV prevalence        113.            113.    1.42e-14
#> 5 16021       HIV prevalence         44.4            44.4   7.11e-15
#> 6 17047       HIV prevalence         46.7            46.7   7.11e-15

In a clean allocation, the difference column should be very close to zero.

Sex-stratified extrapolation

For sex-stratified AtlasPlus data, the allocation denominator should also be sex-specific.

hiv_tract_sex <- get_atlas(
  disease = "hiv",
  year = 2022,
  geography = "county",
  stratify_by = "sex",
  extrapolate_to_tract = TRUE
)

This means that male county cases are allocated across tracts using the male population in each tract, and female county cases are allocated using the female population in each tract.

Conceptually:

\[ \widehat{\text{tract cases}}_{\text{male}} = \text{county cases}_{\text{male}} \times \frac{\text{tract population}_{\text{male}}} {\text{county population}_{\text{male}}} \]

\[ \widehat{\text{tract cases}}_{\text{female}} = \text{county cases}_{\text{female}} \times \frac{\text{tract population}_{\text{female}}} {\text{county population}_{\text{female}}} \] The resulting data should include one row per tract per sex stratum.

hiv_tract_sex %>%
  count(sex, indicator)
#> # A tibble: 4 × 3
#>   sex    indicator          n
#>   <chr>  <fct>          <int>
#> 1 Female HIV diagnoses  85396
#> 2 Female HIV prevalence 85396
#> 3 Male   HIV diagnoses  85396
#> 4 Male   HIV prevalence 85396

You can verify the allocation within each county and sex stratum.

hiv_tract_sex %>%
  group_by(county_fips, sex, indicator) %>%
  summarize(
    county_cases = first(county_cases),
    tract_cases_sum = sum(tract_cases, na.rm = TRUE),
    difference = tract_cases_sum - county_cases,
    .groups = "drop"
  ) %>%
  arrange(desc(abs(difference))) %>%
  head()
#> # A tibble: 6 × 6
#>   county_fips sex    indicator      county_cases tract_cases_sum difference
#>   <chr>       <chr>  <fct>                 <dbl>           <dbl>      <dbl>
#> 1 37199       Male   HIV prevalence         95.2            95.2   1.42e-14
#> 2 51007       Male   HIV prevalence         50.7            50.7   7.11e-15
#> 3 51735       Male   HIV prevalence         48.0            48.0   7.11e-15
#> 4 72049       Male   HIV prevalence         63.0            63.0   7.11e-15
#> 5 13043       Female HIV prevalence         27.1            27.1  -3.55e-15
#> 6 32021       Male   HIV prevalence         22.8            22.8   3.55e-15

Age-stratified extrapolation

Age-stratified extrapolation works the same way, except that the Census denominator is based on tract population within the corresponding AtlasPlus age group.

hiv_tract_age <- get_atlas(
  disease = "hiv",
  year = 2022,
  geography = "county",
  stratify_by = "age",
  extrapolate_to_tract = TRUE
)

The output contains one row per tract per age stratum.

hiv_tract_age %>%
  count(age, indicator)
#> # A tibble: 12 × 3
#>    age   indicator          n
#>    <chr> <fct>          <int>
#>  1 13-24 HIV diagnoses  85396
#>  2 13-24 HIV prevalence 85396
#>  3 25-34 HIV diagnoses  85396
#>  4 25-34 HIV prevalence 85396
#>  5 35-44 HIV diagnoses  85396
#>  6 35-44 HIV prevalence 85396
#>  7 45-54 HIV diagnoses  85396
#>  8 45-54 HIV prevalence 85396
#>  9 55-64 HIV diagnoses  85396
#> 10 55-64 HIV prevalence 85396
#> 11 65+   HIV diagnoses  85396
#> 12 65+   HIV prevalence 85396

Again, the diagnostic check is whether tract estimates sum back to the original county count within each county-age stratum.

hiv_tract_age %>%
  group_by(county_fips, age, indicator) %>%
  summarize(
    county_cases = first(county_cases),
    tract_cases_sum = sum(tract_cases, na.rm = TRUE),
    difference = tract_cases_sum - county_cases,
    .groups = "drop"
  ) %>%
  arrange(desc(abs(difference))) %>%
  head()
#> # A tibble: 6 × 6
#>   county_fips age   indicator      county_cases tract_cases_sum difference
#>   <chr>       <chr> <fct>                 <dbl>           <dbl>      <dbl>
#> 1 48301       35-44 HIV prevalence     0.0352                 0  -0.0352  
#> 2 48301       45-54 HIV prevalence     0.0277                 0  -0.0277  
#> 3 48301       35-44 HIV diagnoses      0.00149                0  -0.00149 
#> 4 48301       45-54 HIV diagnoses      0.000725               0  -0.000725
#> 5 15005       13-24 HIV prevalence     0.000705               0  -0.000705
#> 6 15005       13-24 HIV diagnoses      0.000162               0  -0.000162

Race/ethnicity-stratified extrapolation

CDC AtlasPlus commonly uses a combined race/ethnicity variable. For example, Hispanic/Latino is represented as a single ethnicity category, while the remaining groups are usually non-Hispanic race categories.

hiv_tract_race <- get_atlas(
  disease = "hiv",
  year = 2022,
  geography = "county",
  stratify_by = "race",
  extrapolate_to_tract = TRUE
)

Race/ethnicity-stratified extrapolation requires care because Census and AtlasPlus race/ethnicity categories may not always align perfectly.

The package attempts to use Census variables that correspond to the AtlasPlus combined race/ethnicity categories, but users should inspect the denominator mapping before interpreting results.

hiv_tract_race %>%
  count(race_ethnicity, indicator)
#> # A tibble: 14 × 3
#>    race_ethnicity                         indicator          n
#>    <chr>                                  <fct>          <int>
#>  1 American Indian/Alaska Native          HIV diagnoses  85396
#>  2 American Indian/Alaska Native          HIV prevalence 85396
#>  3 Asian                                  HIV diagnoses  85396
#>  4 Asian                                  HIV prevalence 85396
#>  5 Black/African American                 HIV diagnoses  85396
#>  6 Black/African American                 HIV prevalence 85396
#>  7 Hispanic/Latino                        HIV diagnoses  85396
#>  8 Hispanic/Latino                        HIV prevalence 85396
#>  9 Multiracial                            HIV diagnoses  85396
#> 10 Multiracial                            HIV prevalence 85396
#> 11 Native Hawaiian/Other Pacific Islander HIV diagnoses  85396
#> 12 Native Hawaiian/Other Pacific Islander HIV prevalence 85396
#> 13 White                                  HIV diagnoses  85396
#> 14 White                                  HIV prevalence 85396

Multiple stratification variables

Currently not supported

Working with geometry

Currently not supported