-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Azure): multiprocessing #53
base: worker-pool-v4.0
Are you sure you want to change the base?
Conversation
for provider_setting in provider_settings.values(): | ||
# this is so confusing - plural settings to setting? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for azure_settings in provider_settings.values()
?
or for azure_entry in provider_settings.values()
?
or for provider_entry in provider_settings.values()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changing the context property name from provider_settings to provider_setting would make the most sense. It's not a big deal, but I tripped on it a couple of times. AWS and GCP are probably doing the same.
|
||
# TODO: figure out how to make this wait until scan is finished: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** After writing this, # 2 is definitely better and is a fine solution. Figured I would leave my thought process though **
I don't think this solution is compatible with multiprocessing. Since we don't scan by region, but region is part of the label, options I see are:
- setting a ttl on the asset and only submitting what we find
Pros: easy
Cons: could leave seeds "stale" for up to 48 hours (it's probably ~24 hours now depending on attribution) - One label will correspond to one process in this case, so we should be able to move the submission of empty payloads to labels_not_found within the single process. In this case it would be right after
super().scan(**kwargs)
inself.scan()
. If we split up seeds/assets like we talked about for aws/gcp:
scan_all
scan_seeds()
super.scan_seeds()
get_seeds()
scan_cloud_assets()
super.scan_cloud_assets()
get_cloud_assets()
then it would would be at the end of scan_seeds()
after it calls super.scan_seeds()
, where "it" is
if self.scan_all_regions:
for label_not_found in scan_context.possible_labels:
self.delete_seeds_by_label(label_not_found)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this one some more.
My concern is that we will further split up how the connector works over time. I don't know that we can rely on an account or any level of region being in a single process over time. I could see each resource type being spawned off as well.
Another strategy could be to keep track of the "run id" or create a single value for the entire run. Upon completion, we could instruct items that aren't part of the current run should be removed.
The cloud provider change streams might be a better way to handle this prune routine as well.
It will be hard to manage our concurrency patterns with features like this over time. With large enough customers we might not be able to hold all of the possible labels in ram either. I frequently saw Out Of Memory (OOM) errors due to patterns like this while trying to build ingestion pipelines. Those instances had something like 16 GB of ram.
self.logger.info( | ||
f"Scanning AWS account {self.account_number} in region {self.region}" | ||
f"Scanning AWS - account:{scan_context.account_number} region:{scan_context.region}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe specify "scanning for seeds" here? otherwise we'll have the same log twice and not know which was for seeds vs cloud assets. not sure if that's a big deal
- credential passing - possible labels + delete seeds by label - cloud asset use label prefix - healthcheck log errors if dry run enabled
Add multiprocessing support to Azure provider.
This PR should be merged into branch
worker-pool-v4.0
when ready.