This is a dashboard built with Dash that analyzes shareholder letters from Amazon going back to 1998.
Click here to view the dashboard
View the post on my website about this project here
If you'd like to run my code on your own computer:
- Clone the repo
- Create / activate a virtual environment using
requirements.txt
- Start the dashboard with
python app.py
- Go to
http://127.0.0.1:8050/
in your browser
I've included the scraped pdf files, but if you'd like to rescrape them you can use the getdata.py
script in notebooks
.
- Download all the shareholder letters as pdfs (using
selenium
) - Extract the text out of the pdfs (using
pypdf
) - Analyze the text (using
nltk
andwordninja
) - Display the results (using
Dash
andHeroku
)
I had a few goals with this project:
- Learn more about Dash
- Learn more about sentiment analysis
- Finish an end-to-end analytics project, starting from data collection (my selenium script) to a final deliverable (deployed dashboard)
And oh boy did I learn quite a bit:
-
First and foremost, just the syntax and general workings of
Dash
,plotly
, andnltk
-
Identifiying "sentences" was harder than I anticipated, as I needed to take into accounts acronyms ("etc.")
-
pypdf
would sometimes combine words as each line broke, for example, if the pdf looked like this:a test sentence more words
-
The text output would end up being:
a test sentencemore words
- I fixed this by using a package called
wordninja
which will split the above into:['a', 'test', 'sentence', 'more', 'words']
- I fixed this by using a package called
-
I built a lot of analysis on top of the lemmatized version of the text instead of the raw version, but I didn't quite crack how to map between the lemmatized -> raw text
- For example, if you search for a term on the search page, the concordance table on the bottom will show the lemmatized version of the text instead of the raw version of the text. Definitely an area for improvement.
-
I really liked the development experience of
Dash
, the@callback
decorator clicked with me and I feel way more confident building vizzes withplotly
and adding interactivity withDash
. -
If you deploy an app to Heroku and mistakenly name your
Procfile
something likeProcFile
, none of the dynos will get created in Heroku and the web service wont ever start. I needed to DELETE it, commit the deletion, deploy to Heroku, then re-add the Procfile (named correctly) and deploy.
- My webscraping script isn't perfect, for some reason the css selector didn't download the 2007 letter so I just downloaded that one manually
- How you define "word" / "sentence" / "punctuation" can change the output your analysis dramatically