Added display_crawl_results #20

vringar · 2020-06-15T09:58:59Z

This function should give the user some general overview over the crawl_history and what kind of data loss to expect.

openwpm_utils/crawlhistory.py

englehardt · 2020-07-01T00:58:39Z

@vringar do you have a sample notebook I can use to check the outputs of these? I know we don't have tests (which is okay for now), but that would help me check whether everything is what I'd expect.

englehardt

.

vringar · 2020-07-06T13:37:10Z

@vringar do you have a sample notebook I can use to check the outputs of these? I know we don't have tests (which is okay for now), but that would help me check whether everything is what I'd expect.

This function is used in the dataquality notebook on Databricks.

Downloading files via the SparkContext was much slower than downloading via boto (which is what S3Dataset does. So now both classes use the same method, as PySparkS3Dataset inherits from S3Dataset

This parameter allows for filtering out VisitIds that are part of `incompleted_visits` or that had a command with a command_status other than "ok" since users probably shouldn't consider them for analysis This filtering functionality is extracted into the TableFilter class to be reused by other Datasets.

vringar force-pushed the display_crawl_history branch from 1b2bee5 to 8e88f65 Compare June 15, 2020 10:23

englehardt reviewed Jun 30, 2020

View reviewed changes

openwpm_utils/crawlhistory.py Outdated Show resolved Hide resolved

englehardt suggested changes Jul 1, 2020

View reviewed changes

vringar force-pushed the display_crawl_history branch 2 times, most recently from 3220d0f to d8eb4bb Compare July 20, 2020 14:38

vringar changed the base branch from load_table_enhancement to master April 9, 2021 10:27

Stefan Zabka and others added 2 commits April 9, 2021 12:38

Removed collect_content from PySparkS3Dataset

33bb9a2

Downloading files via the SparkContext was much slower than downloading via boto (which is what S3Dataset does. So now both classes use the same method, as PySparkS3Dataset inherits from S3Dataset

vringar force-pushed the display_crawl_history branch 2 times, most recently from cb8a25f to 00a3d47 Compare April 9, 2021 11:57

vringar and others added 7 commits April 9, 2021 14:02

Added display_crawl_results

c43cee8

Rewrote crawlhistory.py

5925ac9

Used typeannotations

5639239

Fixing display_crawl_history

2312e0e

Added docstrings

3deea98

Added demo file

cb19511

Backporting from next

247adea

vringar force-pushed the display_crawl_history branch from 00a3d47 to 247adea Compare April 9, 2021 12:10

vringar requested a review from englehardt April 9, 2021 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added display_crawl_results #20

Added display_crawl_results #20

vringar commented Jun 15, 2020

englehardt commented Jul 1, 2020

englehardt left a comment

vringar commented Jul 6, 2020

Added display_crawl_results #20

Are you sure you want to change the base?

Added display_crawl_results #20

Conversation

vringar commented Jun 15, 2020

englehardt commented Jul 1, 2020

englehardt left a comment

Choose a reason for hiding this comment

vringar commented Jul 6, 2020