Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added additional data collection capabilities and fixed bugs in scraper.py #27

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

diehl
Copy link

@diehl diehl commented Jul 14, 2020

Additional data elements that are now collected per post:

  • Post creator
  • Post creation datetime
  • Post creation like count
  • Post creation share count
    -- Previously collected inconsistently as a string. Now collected reliably as an integer.
  • Post creation comment count
  • Complete post text
    -- If a post was being shared by a FB user and additional text was added in the act of sharing, that text was lost. Fixed now.

Fixed a bug in the collection of comment threads. In the previous implementation, comment text was saved in dictionaries that were indexed by the comment author. This would result in dropped content when the same FB user would post multiple times in the comment thread.

The code has been refactored a bit as well to allow the contents of the web scraping to be read from disk and parsed. The contents of the web scraping is saved to disk prior to parsing in case there's an error downstream. This allows for subsequent debugging.

…ments, number of shares, number of likes, and the full comment history. Also added the capability to parse the html separately after acquiring the page source which is now written to a binary file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant