Several thousand duplicate records in taxdata-created puf.csv #398

donboyd5 · 2021-07-08T15:49:52Z

Probably of interest to @andersonfrailey @MattHJensen :

I created a puf.csv using the master branch of taxdata git cloned on 7/2/2021, by running make cps-files then make puf-files. I think I did it correctly.

I just finished creating state weights for this file and noticed many groups of records where the state weights are identical - for example, RECIDs 1-5 all have the same weight for every state. This means that the data for the 22 variables I used in the targeting process for these 5 records was identical.

That led me to look for duplicates in my 7/2/2021 puf.csv (code copied below). I did it two ways: (1) looking for duplicates using all variables other than RECID, and (2) also dropping FLPDYR on the theory that even if FLPDYR varies, if other variables are all the same then it is really the same person (there might be additional variables we should think of similarly).

This analysis showed that there are 2,673 records that are EXACT duplicates on all variables other than RECID (99 variables). If we also exclude FLPDYR, there are 2,828 records that are exact duplicates on the 98 variables compared.

This was a surprise to me and seems like it might not be intended.

In addition, I looked at the very first record in my 7/2/2021 puf.csv (RECID==1), which is duplicated 5 times under my second method, and it appears to be the same as the first record in the last official puf.csv I have ready access to (8/20/2020). (I can't easily compare other records because after that RECID is no longer comparable.)

This record has s006=147096 in the 8/20/2020 puf.csv, while ALL 5 corresponding duplicate records in my 7/2/2021 puf have s006=147096. If the duplicates occurred because they are being matched against cps records or something like that, I would have thought the weight would be split so that each of the 5 copies would have a smaller weight that, when summed across the 5 duplicates, equals 147096. Because that is not the case, my guess is that the weights on the duplicate records are too great.

Is it possible that this is unintended?

Thanks.

import pandas as pd

DIR_FOR_BOYD_PUFCSV = r'/media/don/data/puf_files/puf_csv_related_files/Boyd/2021-07-02/'
PUFPATH = DIR_FOR_BOYD_PUFCSV + 'puf.csv'

puf = pd.read_csv(PUFPATH)

# check for duplicates using ALL columns other than RECID
cols = puf.columns.tolist()
cols.remove('RECID')
dups = puf[puf.duplicated(subset=cols, keep=False)] # keep = False keeps all duplicate records
dups.shape  # 2673 duplicates

# also drop FLPDYR
cols2 = cols.copy()
cols2.remove('FLPDYR')
dups2 = puf[puf.duplicated(subset=cols2, keep=False)]
dups2.shape  # 2828 duplicates

# which h_seq has the most duplicates?
dups2.h_seq.value_counts()

The text was updated successfully, but these errors were encountered:

donboyd5 changed the title ~~Several thousand duplicate records in taxdata~~ Several thousand duplicate records in taxdata's puf.csv Jul 8, 2021

donboyd5 changed the title ~~Several thousand duplicate records in taxdata's puf.csv~~ Several thousand duplicate records in taxdata-created puf.csv Jul 8, 2021

donboyd5 mentioned this issue Jul 28, 2021

Possible process for bringing TaxData puf.csv closer to IRS published data and forecasts of same #400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several thousand duplicate records in taxdata-created puf.csv #398

Several thousand duplicate records in taxdata-created puf.csv #398

donboyd5 commented Jul 8, 2021 •

edited

Loading

Several thousand duplicate records in taxdata-created puf.csv #398

Several thousand duplicate records in taxdata-created puf.csv #398

Comments

donboyd5 commented Jul 8, 2021 • edited Loading

donboyd5 commented Jul 8, 2021 •

edited

Loading