You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created a puf.csv using the master branch of taxdata git cloned on 7/2/2021, by running make cps-files then make puf-files. I think I did it correctly.
I just finished creating state weights for this file and noticed many groups of records where the state weights are identical - for example, RECIDs 1-5 all have the same weight for every state. This means that the data for the 22 variables I used in the targeting process for these 5 records was identical.
That led me to look for duplicates in my 7/2/2021 puf.csv (code copied below). I did it two ways: (1) looking for duplicates using all variables other than RECID, and (2) also dropping FLPDYR on the theory that even if FLPDYR varies, if other variables are all the same then it is really the same person (there might be additional variables we should think of similarly).
This analysis showed that there are 2,673 records that are EXACT duplicates on all variables other than RECID (99 variables). If we also exclude FLPDYR, there are 2,828 records that are exact duplicates on the 98 variables compared.
This was a surprise to me and seems like it might not be intended.
In addition, I looked at the very first record in my 7/2/2021 puf.csv (RECID==1), which is duplicated 5 times under my second method, and it appears to be the same as the first record in the last official puf.csv I have ready access to (8/20/2020). (I can't easily compare other records because after that RECID is no longer comparable.)
This record has s006=147096 in the 8/20/2020 puf.csv, while ALL 5 corresponding duplicate records in my 7/2/2021 puf have s006=147096. If the duplicates occurred because they are being matched against cps records or something like that, I would have thought the weight would be split so that each of the 5 copies would have a smaller weight that, when summed across the 5 duplicates, equals 147096. Because that is not the case, my guess is that the weights on the duplicate records are too great.
Is it possible that this is unintended?
Thanks.
import pandas as pd
DIR_FOR_BOYD_PUFCSV = r'/media/don/data/puf_files/puf_csv_related_files/Boyd/2021-07-02/'
PUFPATH = DIR_FOR_BOYD_PUFCSV + 'puf.csv'
puf = pd.read_csv(PUFPATH)
# check for duplicates using ALL columns other than RECID
cols = puf.columns.tolist()
cols.remove('RECID')
dups = puf[puf.duplicated(subset=cols, keep=False)] # keep = False keeps all duplicate records
dups.shape # 2673 duplicates
# also drop FLPDYR
cols2 = cols.copy()
cols2.remove('FLPDYR')
dups2 = puf[puf.duplicated(subset=cols2, keep=False)]
dups2.shape # 2828 duplicates
# which h_seq has the most duplicates?
dups2.h_seq.value_counts()
The text was updated successfully, but these errors were encountered:
donboyd5
changed the title
Several thousand duplicate records in taxdata
Several thousand duplicate records in taxdata's puf.csv
Jul 8, 2021
donboyd5
changed the title
Several thousand duplicate records in taxdata's puf.csv
Several thousand duplicate records in taxdata-created puf.csv
Jul 8, 2021
Probably of interest to @andersonfrailey @MattHJensen :
I created a puf.csv using the master branch of taxdata git cloned on 7/2/2021, by running
make cps-files
thenmake puf-files
. I think I did it correctly.I just finished creating state weights for this file and noticed many groups of records where the state weights are identical - for example, RECIDs 1-5 all have the same weight for every state. This means that the data for the 22 variables I used in the targeting process for these 5 records was identical.
That led me to look for duplicates in my 7/2/2021 puf.csv (code copied below). I did it two ways: (1) looking for duplicates using all variables other than RECID, and (2) also dropping FLPDYR on the theory that even if FLPDYR varies, if other variables are all the same then it is really the same person (there might be additional variables we should think of similarly).
This analysis showed that there are 2,673 records that are EXACT duplicates on all variables other than RECID (99 variables). If we also exclude FLPDYR, there are 2,828 records that are exact duplicates on the 98 variables compared.
This was a surprise to me and seems like it might not be intended.
In addition, I looked at the very first record in my 7/2/2021 puf.csv (RECID==1), which is duplicated 5 times under my second method, and it appears to be the same as the first record in the last official puf.csv I have ready access to (8/20/2020). (I can't easily compare other records because after that RECID is no longer comparable.)
This record has s006=147096 in the 8/20/2020 puf.csv, while ALL 5 corresponding duplicate records in my 7/2/2021 puf have s006=147096. If the duplicates occurred because they are being matched against cps records or something like that, I would have thought the weight would be split so that each of the 5 copies would have a smaller weight that, when summed across the 5 duplicates, equals 147096. Because that is not the case, my guess is that the weights on the duplicate records are too great.
Is it possible that this is unintended?
Thanks.
The text was updated successfully, but these errors were encountered: