Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce relying on deduplication relations when making content identifiers aligned with graph identifiers #1523

Open
marekhorst opened this issue Mar 6, 2025 · 0 comments
Assignees

Comments

@marekhorst
Copy link
Member

marekhorst commented Mar 6, 2025

This is basically to revert #1393 because we need to rely on the DeduplicationMappingConverter again but also bringing back the logic of translating content identifiers based on dedup relations imported from the graph.

3 years ago this was replaced with #1264, which was about relying on the mapping between the original identifiers and persistent identifiers, and since we decided to run the IIS on a non-deduped version of the graph back then we could simply drop the id translation based on the dedup identifiers.

Now we want to be able to run the IIS also on deduped version of the graph which means we need to rely on two layers of id translation for contents:

  1. already implemented translation between the original identifiers (as defined in PDF Aggregation System) and persistent identifiers (defined in the graph)
  2. "resurrected" translation between the persistent identifiers and deduplicated identifiers whenever given persistent id was deduped

This way we should end up with content identifiers which are fully matchable with deduped graph. When running IIS on the non deduplicated graph the 2nd id translation will not be applied because there will be no merges relations in the non-deduped graph.

@marekhorst marekhorst self-assigned this Mar 6, 2025
marekhorst added a commit that referenced this issue Mar 6, 2025
…ing content identifiers aligned with graph identifiers

InfoSpace importer now relies on two layes of mappings in order to fully support import from both deduplicated and non-deduplicated graph:
* between the original identifiers (as defined in PDF Aggregation System) and persistent identifiers (defined in the graph)
* between the persistent identifiers and deduplicated identifiers whenever given entity with a persistent id was dedupedlicated

The first mapping was in use up until now when IIS was mostly run on non-deduplicated data. The second mapping was reintroduced after it was replaced by the first mapping as a part of git#1264.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant