-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added display_crawl_results #20
base: master
Are you sure you want to change the base?
Conversation
1b2bee5
to
8e88f65
Compare
@vringar do you have a sample notebook I can use to check the outputs of these? I know we don't have tests (which is okay for now), but that would help me check whether everything is what I'd expect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
This function is used in the dataquality notebook on Databricks. |
3220d0f
to
d8eb4bb
Compare
Downloading files via the SparkContext was much slower than downloading via boto (which is what S3Dataset does. So now both classes use the same method, as PySparkS3Dataset inherits from S3Dataset
This parameter allows for filtering out VisitIds that are part of `incompleted_visits` or that had a command with a command_status other than "ok" since users probably shouldn't consider them for analysis This filtering functionality is extracted into the TableFilter class to be reused by other Datasets.
cb8a25f
to
00a3d47
Compare
00a3d47
to
247adea
Compare
This function should give the user some general overview over the crawl_history and what kind of data loss to expect.