Spark NLP for Bill Classification and Graph-Based Insights with Neo4j

Description

This repository contains the code and documentation for a data processing and analysis pipeline focused on legislative bill data from OpenStates. The project aims to categorize bills from the New York State Assembly session 2022-2023 into distinct subject and load the results into a Neo4j graph database.

Data Source

The data is sourced from OpenStates, focusing on the latest NYS Assembly session. The dataset includes bill identifiers, titles, abstracts, names, and sponsoring legislators. The data was bulk downloaded to overcome API request limitations.

Pipeline Components

fetch_data.py - Downloads and stores legislative data in data/raw
data_preprocessing.py - Filters data to the specified session and splits it into NY_Assembly_bills.json and NY_Assembly_bill_sponsors.json.
zero_shot_classification.py - Uses PySpark and Spark NLP for classifying bills into predefined topics using the BERT model for zero-shot classification.
neo4j_ingestion.py - Ingests processed data into Neo4j Aura, a cloud-based graph database for advanced analysis and visualization.

Graph Database Schema

The Neo4j graph database schema includes:

Nodes: Bill, Legislator, Subject
Relationships: IS_SPONSOR (between Legislator and Bill), HAS_SUBJECT (between Bill and Subject)

Acknowledgements

This project utilizes data from OpenStates and technologies like Apache Spark, Spark NLP, and Neo4j Aura.

The Apache Software Foundation. (2023). SparkR: R front end for 'Apache Spark'. Apache Spark
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Kocaman, V., & Talby, D. (2021). Spark NLP: natural language understanding at scale. Software Impacts, 8, 100058. DOI: 10.1016/j.simpa.2021.100058.

Contact

For more information please contact:

Name	Email
Maria Aroca	[email protected] or [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark NLP for Bill Classification and Graph-Based Insights with Neo4j

Description

Data Source

Pipeline Components

Graph Database Schema

Acknowledgements

Contact

About

Releases

Packages

Languages

mparoca/bill_subject_classification

Folders and files

Latest commit

History

Repository files navigation

Spark NLP for Bill Classification and Graph-Based Insights with Neo4j

Description

Data Source

Pipeline Components

Graph Database Schema

Acknowledgements

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages