Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added display_crawl_results #20

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Added display_crawl_results #20

wants to merge 9 commits into from

Conversation

vringar
Copy link
Contributor

@vringar vringar commented Jun 15, 2020

This function should give the user some general overview over the crawl_history and what kind of data loss to expect.

@englehardt
Copy link
Contributor

@vringar do you have a sample notebook I can use to check the outputs of these? I know we don't have tests (which is okay for now), but that would help me check whether everything is what I'd expect.

Copy link
Contributor

@englehardt englehardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@vringar
Copy link
Contributor Author

vringar commented Jul 6, 2020

@vringar do you have a sample notebook I can use to check the outputs of these? I know we don't have tests (which is okay for now), but that would help me check whether everything is what I'd expect.

This function is used in the dataquality notebook on Databricks.

@vringar vringar force-pushed the display_crawl_history branch 2 times, most recently from 3220d0f to d8eb4bb Compare July 20, 2020 14:38
@vringar vringar changed the base branch from load_table_enhancement to master April 9, 2021 10:27
Stefan Zabka and others added 2 commits April 9, 2021 12:38
Downloading files via the SparkContext was much slower than
downloading via boto (which is what S3Dataset does.
So now both classes use the same method, as PySparkS3Dataset
inherits from S3Dataset
This parameter allows for filtering out VisitIds that are part of
`incompleted_visits` or that had a command with a command_status other than
"ok" since users probably shouldn't consider them for analysis

This filtering functionality is extracted into the TableFilter class to
be reused by other Datasets.
@vringar vringar force-pushed the display_crawl_history branch 2 times, most recently from cb8a25f to 00a3d47 Compare April 9, 2021 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants