Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new traversal policy to simplify reharvest #92

Merged
merged 1 commit into from
Sep 27, 2024

Conversation

qtomlinson
Copy link
Collaborator

Previously, the "always" traversal policy was used to retrigger the harvest of a component. This policy essentially reran all the previously successfully executed tools. This can be quite cumbersome, especially for integration tests. Sometimes, certain tool results were not available, such as when the schema version had been updated, causing the previously harvested results to become stale and the tool results with the newly updated schema version to be missing. When the tool result for a specific component is missing, using the "always" policy leads to a "Unreachable for reprocessing " status and the tool being skipped.

To address this issue in integration testing, the approach was to first trigger a round of harvest using the default traversal policy. With the default policy, the tools would run for the component if there were no existing results with the correct schema version. If there were previous results, the tools would skip and use the existing results with the current schema version. This initial run was to ensure a complete set of tool results (both new and old) were available for the correct schema version. Upon completion of the first run, a re-harvest was triggered with "always" traversal policy so that all the tools were rerun and the tool results were updated.

With the introduction of the "reharvestAlways" traversal policy (clearlydefined/crawler#598), the integration tests can now be simplified. This PR handles the adaptation.

With the introduction of the "reharvestAlways" traversal policy, the integration tests can now be simplified.
@qtomlinson qtomlinson marked this pull request as ready for review September 25, 2024 18:48
@qtomlinson
Copy link
Collaborator Author

@RomanIakovlev as per our discussion in the dev meeting, here is PR with the explanation behind introducing the new traversal policy.

@qtomlinson qtomlinson marked this pull request as draft September 27, 2024 15:16
@qtomlinson qtomlinson self-assigned this Sep 27, 2024
@qtomlinson qtomlinson marked this pull request as ready for review September 27, 2024 15:31
Copy link
Collaborator

@RomanIakovlev RomanIakovlev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Do I understand correctly that this would make the harvest part of integration test ~2x faster by only doing a single harvesting per coordinate?

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Sep 27, 2024

Looks nice! Do I understand correctly that this would make the harvest part of integration test ~2x faster by only doing a single harvesting per coordinate?

Yes for new components (in the case of randomly selected components). There is also saving for partially harvested components. For fully harvested components, the improvement is small. This is due to the default traversal policy used in the initial round of harvest, which respects the existing harvest results and bypasses the majority of the downloading and processing work.

@qtomlinson
Copy link
Collaborator Author

The merging of the PR depends on clearlydefined/crawler#598

@qtomlinson qtomlinson merged commit bb29bc7 into clearlydefined:main Sep 27, 2024
4 checks passed
@qtomlinson qtomlinson deleted the qt/use_new_policy branch September 27, 2024 18:31
@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Sep 27, 2024

The harvest integration test is successful and the log can be found here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants