Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(ingest/ldap)fix list index out of range error #8525

Conversation

alplatonov
Copy link
Contributor

@alplatonov alplatonov commented Jul 28, 2023

issue #8523

Checklist

  • [+] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [+] Links to related issues (if applicable)
  • [+] Tests for the changes have been added/updated (if applicable)
  • [+] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [+] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Provided changes:
– Some fields may not be filled in, now there will be a check via dict.get("key")
– In some cases, the LDAP provider does not have pagination functionality, and the ingest fails, although all data was loaded successfully. Now possible to use the pagination_enabled option to work with LDAP providers.
– In LDAP ingestor, the manager_pagination_enabled changed to general pagination_enabled. The old option manager_pagination_enabled now provides a warning massage.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 28, 2023
description="Use pagination while search for managers (enabled by default).",
description="[deprecated] Use pagination_enabled ",
)
_deprecate_manager_pagination_enabled = pydantic_field_deprecated(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required ? There is also manager_filter_enabled. Does it also need renaming ?

Suggestion : Use pydantic_renamed_field instead. No need to write your own validator in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required ?

LDAP service like database, we can get error "SIZELIMIT_EXCEEDED: {'desc': 'Size limit exceeded'}"
This error tells us that we reached the limit of search result entries (1000 by default). There are 2 ways of resolving this error:

  • Go to your LDAP DATABASE and increase the maximum page size (MaxPageSize) option.
  • Use result paginating in your Python code.

Also, LDAP DATABASE sometimes doesn't support pagination at all. For example, google workspace LDAP doesn't support it -> so we need the option to turn off pagination completely, or we get an error.

Sometimes finding a person manager provides unexpected data and can be reduced with pagination.

There is also manager_filter_enabled. Does it also need renaming?

No, this option is responsible for additional manager discovery. We want that person's head will be founded or not.

Suggestion : Use pydantic_renamed_field instead. No need to write your own validator in that case.

Thx, I change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My question was around - why this rename is required - manager_pagination_enabled->pagination_enabled. Does this rename also warrant a rename in manager_filter_enabled->filter_enabled ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old code/way, we always setup page control with line self.lc = create_controls(self.config.page_size).
Sometimes LDAP provider doesn't support pagination. That's why i need an option – enable pagination or not in new code.

Now I have a decision:

  1. create a new variable, for example pagination_enabled and save manager_pagination_enabled
  2. rename manager_pagination_enabled to pagination_enabled

I choose option 2) because I don't see a situation where we need pagination for manager discovery, and at the same time, we need a global user/group pagination. If we have a problem and pagination is needed for the user/group or head, we can enable or disable it globally. The user will need clarification, as I am, with the name manager_pagination_enabled and global enable/disable pagination.

self.report.report_failure(
"ldap-control", "Server ignores RFC 2696 control."
)
if serverctrls:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the source report failure if serverctrls is not truthy ? When is serverctls not truthy ?

Copy link
Contributor Author

@alplatonov alplatonov Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the source report failure if serverctrls is not truth ?

No, only if enabled pagecontrol. i fix it.

When is serverctls not truth?

In some scenarios, clients might want to check if the server supports specific controls without actually using them. To do this, they can send an empty server control in the request and observe the server's response. If the server supports the control, it will be included in the response.

Changed on self.lc

metadata-ingestion/src/datahub/ingestion/source/ldap.py Outdated Show resolved Hide resolved
# - attribute exists
# - attribute is not empty list
if attrs.get(self.config.user_attrs_map["email"]):
email = (attrs[self.config.user_attrs_map["email"]][0]).decode()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this logic be extracted into a method and use the method - say - get_attr_or_none everywhere. looks like a lot of repetition here..

def get_attr_or_none(attrs, key, default=None):
    return attrs[self.config.user_attrs_map[key]][0].decode() if attrs.get(self.config.user_attrs_map[key]) else default

also is it assumed that attrs[self.config.user_attrs_map[key]] is instance of list ? Is is always true ? May help to add a check here to avoid errors.

else None
)

if attrs.get(self.config.group_attrs_map["email"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

@mayurinehate mayurinehate added the community-contribution PR or Issue raised by member(s) of DataHub Community label Aug 4, 2023
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@asikowitz asikowitz merged commit b58f9bb into datahub-project:master Aug 9, 2023
43 checks passed
yoonhyejin pushed a commit that referenced this pull request Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants