Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(ingest/ldap)fix list index out of range error #8525

Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ individually enable / disable desired field metrics.
### Deprecations

- #8198: In the Python SDK, the `PlatformKey` class has been renamed to `ContainerKey`.
- #8525: In LDAP ingestor, the `manager_pagination_enabled` changed to general `pagination_enabled`

### Other notable Changes

Expand Down
159 changes: 98 additions & 61 deletions metadata-ingestion/src/datahub/ingestion/source/ldap.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@

import ldap
from ldap.controls import SimplePagedResultsControl
from pydantic.class_validators import root_validator
from pydantic.fields import Field

from datahub.configuration.common import ConfigurationError
from datahub.configuration.pydantic_field_deprecation import pydantic_field_deprecated
from datahub.configuration.source_common import DatasetSourceConfigMixin
from datahub.ingestion.api.common import PipelineContext
from datahub.ingestion.api.decorators import (
Expand All @@ -16,7 +18,7 @@
support_status,
)
from datahub.ingestion.api.source import MetadataWorkUnitProcessor
from datahub.ingestion.api.workunit import MetadataWorkUnit
from datahub.ingestion.api.workunit import MetadataWorkUnit, logger
from datahub.ingestion.source.state.stale_entity_removal_handler import (
StaleEntityRemovalHandler,
StaleEntityRemovalSourceReport,
Expand Down Expand Up @@ -135,13 +137,41 @@ class LDAPSourceConfig(StatefulIngestionConfigBase, DatasetSourceConfigMixin):

manager_pagination_enabled: bool = Field(
default=True,
description="Use pagination while search for managers (enabled by default).",
description="[deprecated] Use pagination_enabled ",
)
_deprecate_manager_pagination_enabled = pydantic_field_deprecated(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required ? There is also manager_filter_enabled. Does it also need renaming ?

Suggestion : Use pydantic_renamed_field instead. No need to write your own validator in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required ?

LDAP service like database, we can get error "SIZELIMIT_EXCEEDED: {'desc': 'Size limit exceeded'}"
This error tells us that we reached the limit of search result entries (1000 by default). There are 2 ways of resolving this error:

  • Go to your LDAP DATABASE and increase the maximum page size (MaxPageSize) option.
  • Use result paginating in your Python code.

Also, LDAP DATABASE sometimes doesn't support pagination at all. For example, google workspace LDAP doesn't support it -> so we need the option to turn off pagination completely, or we get an error.

Sometimes finding a person manager provides unexpected data and can be reduced with pagination.

There is also manager_filter_enabled. Does it also need renaming?

No, this option is responsible for additional manager discovery. We want that person's head will be founded or not.

Suggestion : Use pydantic_renamed_field instead. No need to write your own validator in that case.

Thx, I change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My question was around - why this rename is required - manager_pagination_enabled->pagination_enabled. Does this rename also warrant a rename in manager_filter_enabled->filter_enabled ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old code/way, we always setup page control with line self.lc = create_controls(self.config.page_size).
Sometimes LDAP provider doesn't support pagination. That's why i need an option – enable pagination or not in new code.

Now I have a decision:

  1. create a new variable, for example pagination_enabled and save manager_pagination_enabled
  2. rename manager_pagination_enabled to pagination_enabled

I choose option 2) because I don't see a situation where we need pagination for manager discovery, and at the same time, we need a global user/group pagination. If we have a problem and pagination is needed for the user/group or head, we can enable or disable it globally. The user will need clarification, as I am, with the name manager_pagination_enabled and global enable/disable pagination.

"manager_pagination_enabled"
)
pagination_enabled: bool = Field(
default=True,
description="Use pagination while do search query (enabled by default).",
)

# default mapping for attrs
user_attrs_map: Dict[str, Any] = {}
group_attrs_map: Dict[str, Any] = {}

# pre = True because we want to take some decision before pydantic initialize the configuration to default values
@root_validator(pre=True)
def pagination_backward_compatibility(cls, values: Dict) -> Dict:
manager_pagination_enabled = values.get("manager_pagination_enabled")
pagination_enabled = values.get("pagination_enabled")
if pagination_enabled is None and manager_pagination_enabled:
logger.warning(
"pagination_enabled is not set but manager_pagination_enabled is set. manager_pagination_enabled is "
"deprecated, please use pagination_enabled instead."
)
logger.info(
"Initializing pagination_enabled from manager_pagination_enabled"
)
values["pagination_enabled"] = manager_pagination_enabled
elif manager_pagination_enabled and pagination_enabled:
raise ValueError(
"manager_pagination_enabled is deprecated. Please use pagination_enabled only."
)

return values


@dataclasses.dataclass
class LDAPSourceReport(StaleEntityRemovalSourceReport):
Expand Down Expand Up @@ -218,7 +248,10 @@ def __init__(self, ctx: PipelineContext, config: LDAPSourceConfig):
except ldap.LDAPError as e:
raise ConfigurationError("LDAP connection failed") from e

self.lc = create_controls(self.config.page_size)
if self.config.pagination_enabled:
self.lc = create_controls(self.config.page_size)
else:
self.lc = None

@classmethod
def create(cls, config_dict: Dict[str, Any], ctx: PipelineContext) -> "LDAPSource":
Expand All @@ -244,7 +277,7 @@ def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
ldap.SCOPE_SUBTREE,
self.config.filter,
self.config.attrs_list,
serverctrls=[self.lc],
serverctrls=[self.lc] if self.lc else [],
)
_rtype, rdata, _rmsgid, serverctrls = self.ldap_client.result3(msgid)
except ldap.LDAPError as e:
Expand Down Expand Up @@ -278,15 +311,17 @@ def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
else:
self.report.report_dropped(dn)

pctrls = get_pctrls(serverctrls)
if not pctrls:
self.report.report_failure(
"ldap-control", "Server ignores RFC 2696 control."
)
if serverctrls:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the source report failure if serverctrls is not truthy ? When is serverctls not truthy ?

Copy link
Contributor Author

@alplatonov alplatonov Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the source report failure if serverctrls is not truth ?

No, only if enabled pagecontrol. i fix it.

When is serverctls not truth?

In some scenarios, clients might want to check if the server supports specific controls without actually using them. To do this, they can send an empty server control in the request and observe the server's response. If the server supports the control, it will be included in the response.

Changed on self.lc

pctrls = get_pctrls(serverctrls)
if not pctrls:
self.report.report_failure(
"ldap-control", "Server ignores RFC 2696 control."
)
break
cookie = set_cookie(self.lc, pctrls)
else:
break

cookie = set_cookie(self.lc, pctrls)

def handle_user(self, dn: str, attrs: Dict[str, Any]) -> Iterable[MetadataWorkUnit]:
"""
Handle a DN and attributes by adding manager info and constructing a
Expand All @@ -300,15 +335,11 @@ def handle_user(self, dn: str, attrs: Dict[str, Any]) -> Iterable[MetadataWorkUn
manager_filter = self.config.filter
else:
manager_filter = None
if self.config.manager_pagination_enabled:
ctrls = [self.lc]
else:
ctrls = None
manager_msgid = self.ldap_client.search_ext(
m_cn,
ldap.SCOPE_BASE,
manager_filter,
serverctrls=ctrls,
serverctrls=[self.lc],
alplatonov marked this conversation as resolved.
Show resolved Hide resolved
)
result = self.ldap_client.result3(manager_msgid)
if result[1]:
Expand Down Expand Up @@ -351,36 +382,39 @@ def build_corp_user_mce(
last_name = attrs[self.config.user_attrs_map["lastName"]][0].decode()
groups = parse_groups(attrs, self.config.user_attrs_map["memberOf"])

email = (
(attrs[self.config.user_attrs_map["email"]][0]).decode()
if self.config.user_attrs_map["email"] in attrs
else ldap_user
)
display_name = (
(attrs[self.config.user_attrs_map["displayName"]][0]).decode()
if self.config.user_attrs_map["displayName"] in attrs
else full_name
)
department_id = (
int(attrs[self.config.user_attrs_map["departmentId"]][0].decode())
if self.config.user_attrs_map["departmentId"] in attrs
else None
)
department_name = (
(attrs[self.config.user_attrs_map["departmentName"]][0]).decode()
if self.config.user_attrs_map["departmentName"] in attrs
else None
)
country_code = (
(attrs[self.config.user_attrs_map["countryCode"]][0]).decode()
if self.config.user_attrs_map["countryCode"] in attrs
else None
)
title = (
attrs[self.config.user_attrs_map["title"]][0].decode()
if self.config.user_attrs_map["title"] in attrs
else None
)
if attrs.get(self.config.user_attrs_map["email"]):
email = (attrs[self.config.user_attrs_map["email"]][0]).decode()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this logic be extracted into a method and use the method - say - get_attr_or_none everywhere. looks like a lot of repetition here..

def get_attr_or_none(attrs, key, default=None):
    return attrs[self.config.user_attrs_map[key]][0].decode() if attrs.get(self.config.user_attrs_map[key]) else default

also is it assumed that attrs[self.config.user_attrs_map[key]] is instance of list ? Is is always true ? May help to add a check here to avoid errors.

else:
email = ldap_user
if attrs.get(self.config.user_attrs_map["displayName"]):
display_name = (
attrs[self.config.user_attrs_map["displayName"]][0]
).decode()
else:
display_name = full_name
if attrs.get(self.config.user_attrs_map["departmentId"]):
department_id = (
attrs[self.config.user_attrs_map["departmentId"]][0]
).decode()
else:
department_id = None
if attrs.get(self.config.user_attrs_map["departmentName"]):
department_name = (
attrs[self.config.user_attrs_map["departmentName"]][0]
).decode()
else:
department_name = None
if attrs.get(self.config.user_attrs_map["countryCode"]):
country_code = (
attrs[self.config.user_attrs_map["countryCode"]][0]
).decode()
else:
country_code = None
if attrs.get(self.config.user_attrs_map["title"]):
title = (attrs[self.config.user_attrs_map["title"]][0]).decode()
else:
title = None

custom_props_map = {}
if self.config.custom_props_list:
for prop in self.config.custom_props_list:
Expand Down Expand Up @@ -420,21 +454,24 @@ def build_corp_group_mce(self, attrs: dict) -> Optional[MetadataChangeEvent]:
full_name = cn[0].decode()
admins = parse_users(attrs, self.config.group_attrs_map["admins"])
members = parse_users(attrs, self.config.group_attrs_map["members"])
email = (
attrs[self.config.group_attrs_map["email"]][0].decode()
if self.config.group_attrs_map["email"] in attrs
else full_name
)
description = (
attrs[self.config.group_attrs_map["description"]][0].decode()
if self.config.group_attrs_map["description"] in attrs
else None
)
displayName = (
attrs[self.config.group_attrs_map["displayName"]][0].decode()
if self.config.group_attrs_map["displayName"] in attrs
else None
)

if attrs.get(self.config.group_attrs_map["email"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

email = attrs[self.config.group_attrs_map["email"]][0].decode()
else:
email = full_name
if attrs.get(self.config.group_attrs_map["description"]):
description = attrs[self.config.group_attrs_map["description"]][
0
].decode()
else:
description = None
if attrs.get(self.config.group_attrs_map["displayName"]):
displayName = attrs[self.config.group_attrs_map["displayName"]][
0
].decode()
else:
displayName = None

group_snapshot = CorpGroupSnapshotClass(
urn=f"urn:li:corpGroup:{full_name}",
aspects=[
Expand Down
35 changes: 0 additions & 35 deletions metadata-ingestion/tests/integration/ldap/ldap_mces_golden.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,41 +55,6 @@
"runId": "ldap-test"
}
},
{
alplatonov marked this conversation as resolved.
Show resolved Hide resolved
"proposedSnapshot": {
"com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot": {
"urn": "urn:li:corpuser:hsimpson",
"aspects": [
{
"com.linkedin.pegasus2avro.identity.CorpUserInfo": {
"customProperties": {
"givenName": "Homer"
},
"active": true,
"displayName": "Homer Simpson",
"email": "hsimpson",
"title": "Mr. Everything",
"managerUrn": "urn:li:corpuser:bsimpson",
"departmentId": 1001,
"departmentName": "1001",
"firstName": "Homer",
"lastName": "Simpson",
"fullName": "Homer Simpson"
}
},
{
"com.linkedin.pegasus2avro.identity.GroupMembership": {
"groups": []
}
}
]
}
},
"systemMetadata": {
"lastObserved": 1615443388097,
"runId": "ldap-test"
}
},
{
"proposedSnapshot": {
"com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,9 @@
"changeType": "UPSERT",
"aspectName": "status",
"aspect": {
"value": "{\"removed\": false}",
"contentType": "application/json"
"json": {
"removed": false
}
},
"systemMetadata": {
"lastObserved": 1615443388097,
Expand All @@ -83,8 +84,9 @@
"changeType": "UPSERT",
"aspectName": "status",
"aspect": {
"value": "{\"removed\": false}",
"contentType": "application/json"
"json": {
"removed": false
}
},
"systemMetadata": {
"lastObserved": 1615443388097,
Expand Down
Loading