Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial PE-based flat file #4

Merged
merged 17 commits into from
Feb 8, 2024
Merged

Conversation

nikhilwoodruff
Copy link
Collaborator

@nikhilwoodruff nikhilwoodruff commented Feb 8, 2024

This PR has my working progress for the PE-based flat file usable in Tax-Calculator. I'll try and keep closely in sync with local progress, and intend for it to be merged once we have Tax-Calculator's Python package successfully running off it within the repo.

@nikhilwoodruff nikhilwoodruff self-assigned this Feb 8, 2024
@nikhilwoodruff nikhilwoodruff marked this pull request as draft February 8, 2024 12:53
@nikhilwoodruff
Copy link
Collaborator Author

Creating a PolicyEngine reform to add tc_VARIABLE_NAME alias variables for inputs, modifying PolicyEngine Core slightly - PolicyEngine/policyengine-core#155 - to make this more compact.

@nikhilwoodruff
Copy link
Collaborator Author

@martinholmer - I'm just adding the very basic flat-file variables (got the required MARS and RECID, just adding earnings) and Tax-Calculator seems to be saying something strange. Screenshot below- am I missing something?

image

@nikhilwoodruff
Copy link
Collaborator Author

Worth noting the PolicyEngine US error I don't get locally, will refine the test environment until it's fixed

@martinholmer
Copy link
Collaborator

martinholmer commented Feb 8, 2024

@nikhilwoodruff, It's hard for me to say what's the problem without looking at the input data file you're using. Can you zip it up and send it to me as an email attachment?

@nikhilwoodruff
Copy link
Collaborator Author

Thanks @martinholmer - just sent.

@nikhilwoodruff
Copy link
Collaborator Author

@martinholmer seems to be a floating-point precision issue (Tax-Calculator has a zero-tolerance check I see). Probably not that big actually, I'll just add some manual adjustments in the code.

@nikhilwoodruff nikhilwoodruff marked this pull request as ready for review February 8, 2024 16:51
@nikhilwoodruff
Copy link
Collaborator Author

OK, this PR now has the basic repo structure, plus one test which checks that we can generate a CSV file of PolicyEngine's Enhanced CPS dataset with some very basic variables (just marital status, tax unit ID and earnings for filer and spouse), and that Tax-Calculator can run the full routine off it. I think we should merge it in now to keep PRs small and get the basic repo structure in the main branch.

@martinholmer
Copy link
Collaborator

@nikhilwoodruff, Thanks for sharing the preliminary flat file. You're absolutely correct to begin with a simple file. Here is what I did with the file, then below that I list a few things that could be improved.

% head -2 tc_input_test.csv
,RECID,MARS,e00200p,e00200s,e00200
0,101.0,1.0,0.0,0.0,0.0

% awk -F, 'NR>1&&($4+$5)!=$6{n++}END{print n}' tc_input_test.csv
6512

% awk -F, 'NR>1&&($4+$5)!=$6' tc_input_test.csv | head
77702,100009500.0,2.0,856409.75,1356590.2,2213000.0
77706,100010200.0,2.0,739231.4,568.6395,739800.0
77709,100010600.0,2.0,1654237.2,297762.72,1952000.0
77710,100011200.0,2.0,325443.75,379306.28,704750.0
77712,100011400.0,2.0,134328.02,306691.0,441019.0
77742,100015800.0,2.0,2054616.9,4167383.2,6222000.0
77744,100016400.0,2.0,26127.652,174184.34,200312.0
77749,100016904.0,2.0,74434.78,1637565.2,1712000.0
77761,100019900.0,2.0,1294146.4,221853.66,1516000.0
77762,100020200.0,2.0,174275.38,42479.625,216755.0

Here is my list of suggestions:

  • omit the Pandas index

  • make the RECID and MARS variables be integer

  • correct for PEUS single-precision problems with large numbers by rounding the e00200p and e00200s variables to the nearest cent and then adding them together to get the value of e00200. The error you got from Tax-Calculator was saying that e00200p and e00200s did not add up to e00200.

@nikhilwoodruff
Copy link
Collaborator Author

nikhilwoodruff commented Feb 8, 2024

Thanks @martinholmer! Will add those suggestions.

Edit: actually, I think we should just post-hoc correct e00200 to equal e00200p + e00200s under Tax-Calculator's data type to minimise the changes to the record data.

@martinholmer
Copy link
Collaborator

@nikhilwoodruff, The precision problems arise because PE-Core uses 32-bit floats while Tax-Calculator using 64-bit floats.

@nikhilwoodruff
Copy link
Collaborator Author

Thanks for your help @martinholmer. This PR passes the test and I've added your suggestions, so merging now.

@nikhilwoodruff nikhilwoodruff merged commit fc30430 into master Feb 8, 2024
2 checks passed
@nikhilwoodruff nikhilwoodruff deleted the nikhilwoodruff/issue2 branch February 8, 2024 17:10
@nikhilwoodruff nikhilwoodruff linked an issue Feb 9, 2024 that may be closed by this pull request
donboyd5 pushed a commit to donboyd5/tax-microdata-benchmarking that referenced this pull request Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add basic variables to flat file
2 participants