This repository contains the code and documentation for a data processing and analysis pipeline focused on legislative bill data from OpenStates. The project aims to categorize bills from the New York State Assembly session 2022-2023 into distinct subject and load the results into a Neo4j graph database.
The data is sourced from OpenStates, focusing on the latest NYS Assembly session. The dataset includes bill identifiers, titles, abstracts, names, and sponsoring legislators. The data was bulk downloaded to overcome API request limitations.
fetch_data.py
- Downloads and stores legislative data indata/raw
data_preprocessing.py
- Filters data to the specified session and splits it intoNY_Assembly_bills.json
andNY_Assembly_bill_sponsors.json
.zero_shot_classification.py
- Uses PySpark and Spark NLP for classifying bills into predefined topics using the BERT model for zero-shot classification.neo4j_ingestion.py
- Ingests processed data into Neo4j Aura, a cloud-based graph database for advanced analysis and visualization.
The Neo4j graph database schema includes:
- Nodes: Bill, Legislator, Subject
- Relationships: IS_SPONSOR (between Legislator and Bill), HAS_SUBJECT (between Bill and Subject)
This project utilizes data from OpenStates and technologies like Apache Spark, Spark NLP, and Neo4j Aura.
- The Apache Software Foundation. (2023). SparkR: R front end for 'Apache Spark'. Apache Spark
- Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Kocaman, V., & Talby, D. (2021). Spark NLP: natural language understanding at scale. Software Impacts, 8, 100058. DOI: 10.1016/j.simpa.2021.100058.
For more information please contact:
Name | |
---|---|
Maria Aroca | [email protected] or [email protected] |