Skip to content

๐Ÿ“Š Analyse mentions of climate issues in presidential speeches with Python

Notifications You must be signed in to change notification settings

mattdean1/nlp-president-speeches

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

29 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

nlp-president-speeches

Extract and visualise mentions of climate change issues in presidential speeches (from all time).

Dockerized everything.

brew install make
# see list of available commands
make

# Load the data, start api and frontend client
# (you might want to go grab a โ˜•๏ธ) 
# look out for the climate-related sentences being printed to console
make runall

# Inspect the model generation and play around (stop the api containers first)
# Open the link from terminal and go to 'app' directory in the sidebar
make etl-notebook

1. Process + Load data (etl)

  • Read .txt from filesystem
  • Model iteration in Jupyter Notebooks
  • Text processing and classification using NLTK and PyTorch
  • All environments using docker/docker-compose

2. Serve (api)

  • Simple flask app exposing rest endpoints
  • Integration with VSCode debugger
  • Hot reloading
  • Again all dockerized

3. Visualise (client)

  • React app displaying the data
  • Charting with Victory
  • Display all sentences about climate issues with related metadata (who, when)
  • Highlight sentences in the context of the full speech

How does it work

  • We share the database as docker volume (etl_postgres_data) between the ETL and API

  • The etl loads the data as follows:

    • Read each folder of president speeches, insert president and speech into the database
    • Sentences are classified using a pretrained classification model based on RoBERTa, using SimpleTransformers lib
    • Trained on a corpus of sentences about climate change and not about climate change
    • Note: we see many sentences being misclassified - e.g. most sentences with numbers in end up the climate set - that could be mitigated by using larger and more representative training datasets

That model actually took too much time to return predictions :(, so I reverted to the original "classifier" - does the sentence contain 1 or more keywords in the set e.g. ["climate", "environment", "green"].

What's next

  1. Use larger datasets for training/evaluation, speed up prediction by passing in more sentences at once or using a smaller model
  2. Share db models between etl and api
  3. Setup python linting / static analysis
  • Split president names into firstname/lastname for better sorting
  • Return pre-sorted / precomputed data from extra endpoints -- then we need to do less manipulation on the client
  • Transform / format data when we recieve it in client (get rid of those underscores ๐Ÿคฎ)
  • Add (more) tests in js and python

Sources

About

๐Ÿ“Š Analyse mentions of climate issues in presidential speeches with Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published