Skip to content

Commit

Permalink
Merge pull request #272 from PSLmodels/pr-prepare-targets-documentati…
Browse files Browse the repository at this point in the history
…on-and-environment-management

Prepare targets documentation and environment management
  • Loading branch information
donboyd5 authored Oct 31, 2024
2 parents 282dacc + b0280ad commit 11788ac
Show file tree
Hide file tree
Showing 22 changed files with 3,597 additions and 571 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ tmd/storage/output/tax_expenditures
!tmd/areas/targets/*.csv
**demographics_2015.csv
**puf_2015.csv
*.DS_STORE
*.DS_STORE
.Rproj.user
1 change: 1 addition & 0 deletions tmd/areas/targets/prepare/.Rprofile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
source("renv/activate.R")
2 changes: 2 additions & 0 deletions tmd/areas/targets/prepare/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@ _freeze/

# Local Netlify folder
.netlify

/.quarto/
10 changes: 10 additions & 0 deletions tmd/areas/targets/prepare/R/constants.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

CDZIPURL <- "https://www.irs.gov/pub/irs-soi/congressional2021.zip"
CDDOCURL <- "https://www.irs.gov/pub/irs-soi/21incddocguide.docx"

CDDIR <- here::here("cds")
CDRAW <- fs::path(CDDIR, "raw_data")
CDINTERMEDIATE <- fs::path(CDDIR, "intermediate")
CDFINAL <- fs::path(CDDIR, "final")

CDDOCEXTRACT <- "cd_documentation_extracted_from_21incddocguide.docx.xlsx"
79 changes: 79 additions & 0 deletions tmd/areas/targets/prepare/R/libraries.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# libraries ---------------------------------------------------------------

library(renv)

library(DT)
library(fs)
library(gt)
library(knitr)
library(readxl)
library(skimr)
library(stringr)
library(tidyverse)
# includes: dplyr, forcats, ggplot2, lubridate, purrr, stringr, tibble, tidyr

tprint <- 75 # default tibble print
options(tibble.print_max = tprint, tibble.print_min = tprint) # show up to tprint rows

# census_api_key("b27cb41e46ffe3488af186dd80c64dce66bd5e87", install = TRUE) # stored in .Renviron
# libraries needed for census population
library(sf)
library(tidycensus)
library(tigris)
options(tigris_use_cache = TRUE)


# possible libraries ------------------------------------------------------

# library(rlang)
# library(tidyverse)
# tprint <- 75 # default tibble print
# options(tibble.print_max = tprint, tibble.print_min = tprint) # show up to tprint rows
#
# library(fs)

# tools
# library(vroom)
# library(readxl)
# library(openxlsx) # for writing xlsx files
# library(lubridate)
# library(RColorBrewer)
# library(RcppRoll)
# library(fredr)
# library(tidycensus)
# library(googledrive)
# library(arrow)
#
# library(jsonlite)
# library(tidyjson)
#
#
# # boyd libraries
# # library(btools)
# # library(bdata)
# # library(bggtools)
# # library(bmaps)
#
# # graphics
# library(scales)
# library(ggbeeswarm)
# library(patchwork)
# library(gridExtra)
# library(ggrepel)
# library(ggbreak)
#
# # tables
# library(knitr)
# library(kableExtra)
# library(DT)
# library(gt)
# library(gtExtras)
# library(janitor)
# library(skimr)
# library(vtable)
#
# # maps
# library(maps)
# # https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
# library(usmap)

15 changes: 8 additions & 7 deletions tmd/areas/targets/prepare/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,26 +29,27 @@ execute:

book:
title: "Develop targets for subnational areas"
subtitle: "Create csv target files for use by area targeting routies"
subtitle: "Create csv target files for use by area targeting routines"
# author: "Don Boyd"
date: today
date-format: long
chapters:
- index.qmd
- part: "Usage and notes"
- part: "Usage"
chapters:
- usage.qmd
- cd_issues_and_TODOs.qmd
- part: "IRS Congressional District data"
- part: "IRS SOI Congressional District data"
chapters:
- cd_overall_documentation.qmd
- cd_get_census_population.qmd
- cd_download_and_clean_census_population_data.qmd
- cd_download_soi_data.qmd
- cd_construct_variable_documentation.qmd
- cd_construct_soi_variable_documentation.qmd
- cd_construct_long_soi_data_file.qmd
- cd_create_basefile_for_cd_target_files.qmd
- cd_create_crosswalk_from_cd117th_to_cd118th.qmd
- cd_map_tcvars_and_extract_target_files.qmd
appendices:
- cd_issues_and_TODOs.qmd
- cd_IRS_documentation.qmd

format:
html:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,15 @@ editor_options:
chunk_output_type: console
---

# About the data
# About the SOI Congressional District data

This chapter has two sections:
All text in this section is copied verbatim from the IRS SOI data documentation (21incddocguide.docx), with no substantive edits and no commentary.

- IRS documentation: Copied verbatim from the IRS SOI data documentation (21incddocguide.docx), with no substantive edits and no commentary.

- Comments on the data: Notes about selected issues, quirks, and pitfalls in the data discovered from working with the data.

## IRS documentation

All text in this section is a direct quote from IRS documentation.

### Time period
## Time period

The Statistics of Income (SOI) Division’s Congressional district data is tabulated using individual income tax returns (Forms 1040) filed with the Internal Revenue Service (IRS) during the 12-month period, January 1, 2022 to December 31, 2022. While the bulk of returns filed during this 12-month period are primarily for Tax Year 2021, the IRS received a limited number of returns for tax years before 2021. These prior-year returns are used as a proxy for returns that are typically filed beyond the 12-month period and have been included within the congressional district data.

### Population Definitions and Tax Return Addresses
## Population Definitions and Tax Return Addresses

- Congressional data are based on the population of individual income tax returns processed by the IRS during the 2022 calendar year.

Expand All @@ -38,15 +30,15 @@ The Statistics of Income (SOI) Division’s Congressional district data is tabul

- Tax returns filed using Army Post Office (APO) and Fleet Post Office addresses, foreign addresses, and addresses in Puerto Rico, Guam, Virgin Islands, American Samoa, Marshall Islands, Northern Marianas, and Palau were excluded.

### Congressional District and ZIP Code Matching Procedures
## Congressional District and ZIP Code Matching Procedures

SOI uses a commercial file to match ZIP codes to congressional districts. Congressional districts cover the 435 congressional districts in the 50 states and the District of Columbia. District boundaries are based on the 117th Congress.

The matching process first utilizes the 9-digit ZIP code, if present on the return, to determine the proper congressional district for that return. Nearly 97 percent of the returns match on the 9-digit ZIP code. When the 9-digit ZIP code is not available, the matching process uses the 5-digit ZIP code to determine the proper congressional district. Returns that do not match on ZIP code, or where a ZIP code is not present, are excluded from the data.

Eight states (AK, DC, DE, MT, ND, SD, VT, and WY) have only one congressional district, therefore the matching procedures are not performed on these states. Returns with only one congressional district represent 2 percent of the total number of returns.

### Disclosure Protection Procedures
## Disclosure Protection Procedures

SOI did not attempt to correct any ZIP codes listed on the tax returns; however, it did take the following precautions to avoid disclosing information about specific taxpayers:

Expand All @@ -56,7 +48,7 @@ SOI did not attempt to correct any ZIP codes listed on the tax returns; however,

- If an income or tax item from one return constitutes more than a specified percentage of the total of any particular cell, the specific data item for that return is excluded from that cell. For example, if the amount for wages from one return represents 75 percent of the value of the total for that cell, the data item will be suppressed. The actual threshold percentage used cannot be released.

### IRS Endnotes
## IRS Endnotes

[1] The use of prior-year returns as a proxy for returns that are filed beyond the current processing year is consistent with SOI’s national, state, county, and ZIP code tabulations. A description of SOI’s sample, which is used as an input for the geographic data, and the use of prior-year returns, can be found at https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-returns-publication-1304-complete-report#_sec2.

Expand Down Expand Up @@ -106,34 +98,3 @@ An advance refund of the 2021 recovery rebate credit made under section 6428B of

[21] The amount of overpayments the tax filer requested to have refunded.


## Comments on the data

### Determining which records are Congressional District records

- Calculate nstub0 -- number of records by state where AGI_STUB == 0 (the totals record)
- Note that CONG_DISTRICT == "00" is a totals record for the state. There are 8 states that only have 1 CD (see IRS documentation above), and for those states this record doubles as a CD record and as the state record.
- Determine type of record:
- US -- STATE == "US"
- DC -- STATE == "DC"
- state -- nstub0 \> 1 & CONG_DISTRICT == "00"
- cdstate -- nstub0 == 1 (this is both a state record and a CD record, for 8 states)
- cd -- nstub0 \> 1 & CONG_DISTRICT != "00"

The cd and cdstate records have data for Congressional Districts. There are 435 of these for AGI_STUB == 0 - one for each voting Congressional District (not including the District of Columbia). SOI data also have records for the nonvoting DC district. It is not included in the 435 Congressional Districts .

The state and cdstate records have data for states. There are 51 of these (4)

![](images/clipboard-719051713.png)

To verify that this produces a proper calculation of the number of districts by state, I asked ChatGPT (4o) the following question, and compared the results by state to the calculation above. They are the same.

> Please give me a table of the number of congressional districts by state (plus the District of Columbia), based on the 117th Congress, ideally as a google sheet or exportable to a spreadsheet. It should have 3 columns: state postal abbreviation, state name, and number of districts. It should add to 435 districts, I believe.
### Exemptions

Note that there are no data on exemptions but we do have total number of individuals (N2).

When run on 2024-10-12 tmd national population was 334,283,385 (\`national_population = (vardf.s006 \* vardf.XTOT).sum()\`). By contrast, the sum of N2 for the U.S. was 289,054,220, or 13.5% less, according to 21incdall.xlsx.

FWIW, IRS total number of returns in was 160,824,340 per 21in14ar.xls. When run on 2024-10-12 tmd sum of s006 was 184,024,657, or 14.4% more. By contrast, the sum of N1 for the U.S. was 157,375,370, or 2.1% less, according to 21incdall.xlsx.
38 changes: 31 additions & 7 deletions tmd/areas/targets/prepare/cd_construct_long_soi_data_file.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ editor_options:

# Parse the Congressional District data

Clean the CD data and save.
Goal: create a 117th CD data file that is almost in the form needed for targets files.

This involves cleaning the SOI Congressional District data, adding agi bin information, adding variable documentation, and saving as a long file.

## Setup

Expand All @@ -28,10 +30,13 @@ Here is an example of the first few rows of a targets file:

![](images/Image 2024-10-20 at 5.23.32 PM.jpeg)

## Create AGI stub information
## Create and save AGI bin labels, and show bins

Create and map AGI bin labels to the AGI bins that SOI uses, and save in ".../cds/intermediate".

```{r}
#| label: agi-bins
#| output: false
# example of targets file
# varname,count,scope,agilo,agihi,fstatus,target
Expand Down Expand Up @@ -83,6 +88,13 @@ file="AGI_STUB; agirange; agilo; agihi
write_csv(agibins, fs::path(CDINTERMEDIATE, "cd_agi_bins.csv"))
```

Show AGI bins.

```{r}
#| label: show-agi-bins
# agibins |> kable()
agibins |>
gt() |>
Expand All @@ -96,13 +108,15 @@ agibins |>
```

## Prepare, clean, and save wide data file

Set eval: to true for these chunks to recreate the data file.
## Prepare, clean, and save SOI Congressional District wide data file

Get previously downloaded IRS SOI data with aggregate information for individual Congressional Districts.

```{r}
#| label: parse-cddata
#| eval: true
#| output: false
# read the csv file from the zip archive that contains it
zpath <- fs::path(CDRAW, fs::path_file(CDZIPURL))
Expand All @@ -115,9 +129,16 @@ count(data, CONG_DISTRICT) # max is 53
```

Clean SOI CD data:

- create record-type variable
- add agi bin labels and bounds
-

```{r}
#| label: clean-save-cddata-wide
#| eval: true
#| output: false
# cleaning and reshaping:
# - determine record type
Expand Down Expand Up @@ -166,9 +187,13 @@ rm(data, data2, cdnums)

## Create long SOI data file

- convert to a long file
- merge with variable documenttion file
- save as "cddata_long_clean.csv" in intermediate file directory

```{r}
#| label: create-save-soi-cddata-long
#| eval: false
#| eval: true
cdwide <- read_csv(fs::path(CDINTERMEDIATE, "cddata_wide_clean.csv"))
doc <- read_csv(fs::path(CDINTERMEDIATE, "variable_documentation.csv"))
Expand All @@ -179,15 +204,14 @@ glimpse(doc)
idvars <- c("rectype", "ndist", "STATEFIPS", "STATE", "CONG_DISTRICT",
"AGI_STUB", "agirange", "agilo", "agihi")
# TODO: put amount units in dollars!!
dlong1 <- cdwide |>
pivot_longer(cols = -all_of(idvars),
names_to = "vname") |>
left_join(doc |>
select(vname, description, reference, vtype, basevname),
by = join_by(vname))
glimpse(dlong1)
count(dlong1, vname)
count(dlong1, vtype)
Expand Down
Loading

0 comments on commit 11788ac

Please sign in to comment.