Bills data source and script / data format #2

ninjaguru-git · 2022-02-19T15:59:15Z

Data source and script

Currently, there are 4 data sources according to unitedstates/congress wiki.

Contemporary Bills (2013-Present)

Modern Historical Bills (1973-2012)

Statutes via the Statutes at Large (1951-1972)

Early Historical Bills via the American Memory Project (1799-1873)

Congress wiki suggests different scraping scripts depending on data sources. We need to confirm the below several problems to clarify the billtitleindex solution.

Problem 1: The README is now only describing about the Contemporary Bills. So we need to confirm which data source we should use for billtitleindex solution.
Problem 2: This problem is related to bills scraping script and should we consider about the scraping scripts for other data sources? If we use united/congress python-package solution, should we use it directly in billtitleindex solution or develop customized solution without this for scraping and updating data periodically?

Data format

Currently, if we run usc-run govinfo --bulkdata=BILLSTATUS, we will get only fdsys_billstatus.xml, and fdsys_billstatus-lastmod.txt per each bill from Contemporary Bills and some parts of Modern Historical Bills. After using usc-run bills, we will get data.xml, data.json and data-frofdsys-lastmod.txt files. It means some kind of data conversion.
fdsys_billstatus.xml is the dedicated xml file according to our purpose, but this file does not have tags that we expected in README.md like "as", "is_for_portion" - keys. Instead of this xml file, processed data.json file has these keys.
Modern Historical Bills has only data.json files without xml files.
So we need to confirm several following problems related to data format that we should process for billtitleindex solution.

Problem 3: Should we use these processed data.json file for indexing or parse XML files using lxml?
Problem 4: Due to this link has no information, usc-run govinfo --bulkdata=BILLSTATUS is failing when it downloads 112th congress bills without skipping. As a result, this command is not updating some congress bills. We need to confirm about the 112th congress bills data.

The text was updated successfully, but these errors were encountered:

aih · 2022-02-19T16:50:18Z

We need mostly recent bills, after the 115th Congress. It is probably easiest to run the usc-run bills function and parse the resulting data.json. This creates more metadata than we need, but it's simple and does have the title data.

aih · 2022-02-19T16:52:05Z

We should include the unitedstates/congress scraper as a pip requirement, now that it can be installed as a package.

ninjaguru-git assigned aih and ninjaguru-git Feb 19, 2022

ninjaguru-git added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bills data source and script / data format #2

Bills data source and script / data format #2

ninjaguru-git commented Feb 19, 2022 •

edited

Loading

aih commented Feb 19, 2022

aih commented Feb 19, 2022

Bills data source and script / data format #2

Bills data source and script / data format #2

Comments

ninjaguru-git commented Feb 19, 2022 • edited Loading

aih commented Feb 19, 2022

aih commented Feb 19, 2022

ninjaguru-git commented Feb 19, 2022 •

edited

Loading