Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bills data source and script / data format #2

Open
ninjaguru-git opened this issue Feb 19, 2022 · 2 comments
Open

Bills data source and script / data format #2

ninjaguru-git opened this issue Feb 19, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@ninjaguru-git
Copy link
Contributor

ninjaguru-git commented Feb 19, 2022

  1. Data source and script

Currently, there are 4 data sources according to unitedstates/congress wiki.

  • Contemporary Bills (2013-Present)
  • Modern Historical Bills (1973-2012)
  • Statutes via the Statutes at Large (1951-1972)
  • Early Historical Bills via the American Memory Project (1799-1873)

Congress wiki suggests different scraping scripts depending on data sources. We need to confirm the below several problems to clarify the billtitleindex solution.

Problem 1: The README is now only describing about the Contemporary Bills. So we need to confirm which data source we should use for billtitleindex solution.
Problem 2: This problem is related to bills scraping script and should we consider about the scraping scripts for other data sources? If we use united/congress python-package solution, should we use it directly in billtitleindex solution or develop customized solution without this for scraping and updating data periodically?

  1. Data format

Currently, if we run usc-run govinfo --bulkdata=BILLSTATUS, we will get only fdsys_billstatus.xml, and fdsys_billstatus-lastmod.txt per each bill from Contemporary Bills and some parts of Modern Historical Bills. After using usc-run bills, we will get data.xml, data.json and data-frofdsys-lastmod.txt files. It means some kind of data conversion.
fdsys_billstatus.xml is the dedicated xml file according to our purpose, but this file does not have tags that we expected in README.md like "as", "is_for_portion" - keys. Instead of this xml file, processed data.json file has these keys.
Modern Historical Bills has only data.json files without xml files.
So we need to confirm several following problems related to data format that we should process for billtitleindex solution.

Problem 3: Should we use these processed data.json file for indexing or parse XML files using lxml?
Problem 4: Due to this link has no information, usc-run govinfo --bulkdata=BILLSTATUS is failing when it downloads 112th congress bills without skipping. As a result, this command is not updating some congress bills. We need to confirm about the 112th congress bills data.

@ninjaguru-git ninjaguru-git added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Feb 19, 2022
@aih
Copy link
Collaborator

aih commented Feb 19, 2022

We need mostly recent bills, after the 115th Congress. It is probably easiest to run the usc-run bills function and parse the resulting data.json. This creates more metadata than we need, but it's simple and does have the title data.

@aih
Copy link
Collaborator

aih commented Feb 19, 2022

We should include the unitedstates/congress scraper as a pip requirement, now that it can be installed as a package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants