Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-5231: fix poor query performance for hasStatements() in FedX #5232

Merged
merged 2 commits into from
Jan 24, 2025

Conversation

aschwarte10
Copy link
Contributor

@aschwarte10 aschwarte10 commented Jan 15, 2025

GitHub issue resolved: #5231

The previous implementation of the FedXConnection was delegating "hasStatements()" to the implementation of "getStatements()", where the latter was actually fetching data from the federation members.

For checks hasStatements() checks like {null, rdf:type, null} or even {null, null, null} the implementation is problematic as it would fetch all data matching the pattern from the federation members, only to answer if it actually exists.

We now make use of "existence" check on the federation members, and can actually rely on the source selection cache for this.

Unit test coverage has been added.


PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

The previous implementation of the FedXConnection was delegating
"hasStatements()" to the implementation of "getStatements()", where the
latter was actually fetching data from the federation members.

For checks hasStatements() checks like {null, rdf:type, null} or even
{null, null, null} the implementation is problematic as it would fetch
all data matching the pattern from the federation members, only to
answer if it actually exists.

We now make use of "existence" check on the federation members, and can
actually rely on the source selection cache for this.

Unit test coverage has been added.
Comment on lines 639 to 647
/**
* Returns the accessible federation members in the context of the query. By default this is all federation members.
*
* @param queryInfo
* @return
*/
protected List<Endpoint> getAccessibleFederationMembers(QueryInfo queryInfo) {
return federationContext.getFederation().getMembers();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more why this needs to be protected. I didn't see that it's override anywhere in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We make use of specializations of FederationEvalStrategy to validate new optimizations (before contributing them back to RDF4J) or add additional (special) ones applicable to our use-cases. Without having this method accessible in sub-classes, we would need to override the two methods entirely (get/hasStatementsInternal) as we have a specialization for accessible members, currently validating resilience, but looking ahead also for policies/permissions. Does this help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good. Can you update the docs to reflect that this is meant to be an extension point for subclasses. If you are unsure if you need to change it in future you can annotate it as experimental if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, done

@hmottestad
Copy link
Contributor

Btw, as a rule of thumb we usually consider performance issues as bugs and allow them in bug fix releases.

Copy link
Contributor

@hmottestad hmottestad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge when you are ready.

@hmottestad
Copy link
Contributor

I cancelled the test that was taking too long. It's an issue related to some very specific edge cases when connections and resources are not closed before trying to shutdown the ShaclSail. I have a PR open for trying to fix it, but it's a bit tricky.

@aschwarte10 aschwarte10 merged commit dcacf74 into main Jan 24, 2025
8 of 9 checks passed
@aschwarte10 aschwarte10 deleted the GH-5231-poor-performance-has-statement branch January 24, 2025 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FedX: Poor potential query performance for RepositoryConnections#hasStatements()
2 participants