Added additional data collection capabilities and fixed bugs in scraper.py #27

diehl · 2020-07-14T18:40:52Z

Additional data elements that are now collected per post:

Post creator
Post creation datetime
Post creation like count
Post creation share count
-- Previously collected inconsistently as a string. Now collected reliably as an integer.
Post creation comment count
Complete post text
-- If a post was being shared by a FB user and additional text was added in the act of sharing, that text was lost. Fixed now.

Fixed a bug in the collection of comment threads. In the previous implementation, comment text was saved in dictionaries that were indexed by the comment author. This would result in dropped content when the same FB user would post multiple times in the comment thread.

The code has been refactored a bit as well to allow the contents of the web scraping to be read from disk and parsed. The contents of the web scraping is saved to disk prior to parsing in case there's an error downstream. This allows for subsequent debugging.

…ments, number of shares, number of likes, and the full comment history. Also added the capability to parse the html separately after acquiring the page source which is now written to a binary file.

Added code to scraper.py to save the datetime, creator, number of com…

393b5d7

…ments, number of shares, number of likes, and the full comment history. Also added the capability to parse the html separately after acquiring the page source which is now written to a binary file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added additional data collection capabilities and fixed bugs in scraper.py #27

Added additional data collection capabilities and fixed bugs in scraper.py #27

diehl commented Jul 14, 2020

Added additional data collection capabilities and fixed bugs in scraper.py #27

Are you sure you want to change the base?

Added additional data collection capabilities and fixed bugs in scraper.py #27

Conversation

diehl commented Jul 14, 2020