Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleNeo4jRepository.findAll(Pageable) fails on big databases #2597

Closed
bonelli opened this issue Sep 19, 2022 · 9 comments
Closed

SimpleNeo4jRepository.findAll(Pageable) fails on big databases #2597

bonelli opened this issue Sep 19, 2022 · 9 comments
Assignees
Labels
status: feedback-provided Feedback has been provided

Comments

@bonelli
Copy link

bonelli commented Sep 19, 2022

Invoking SimpleNeo4jRepository.findAll(Pageable) will cause OOME on databases with many nodes, even with a small page size.

After some debugging, I found the probable cause in these lines of the Neo4jTemplate.createNodesAndRelationshipsByIdStatementProvider() method:

Statement rootNodesStatement = cypherGenerator
.prepareMatchOf(entityMetaData, queryFragments.getMatchOn(), queryFragments.getCondition())
.returning(Constants.NAME_OF_SYNTHESIZED_ROOT_NODE).build();

The rootNodesStatement that is used to detect circular dependencies is quering the database for all nodes of the given type, extracting their ID.

This is no big deal on small databases, but in my case I have a single database with +10M nodes of the given type (e.g. Foo nodes), and the JVM starts downloading many millions of IDs from the database, eventually crashing for OOME or failing the execution for query timeout.

Is it essential to request the complete set of IDs for all of the Foo entities in the database, even when I'm just trying to fetch one page with 20sh elements?

@meistermeier
Copy link
Collaborator

Thanks for reporting this. There is definitely a missed skip/limit in the initial query.
I am pretty sure that the 6.3.4-GH-2597-SNAPSHOT will help you with this. It would be very helpful for us, if you could try them with the request above on your domain.

If you don't have it done already, here are the repository definitions for Spring snapshots and milestones to be included in your repository definition.

<repositories>
  <repository>
    <id>spring-milestones</id>
    <name>Spring Milestones</name>
    <url>https://repo.spring.io/milestone</url>
    <snapshots>
      <enabled>false</enabled>
    </snapshots>
  </repository>
  <repository>
    <id>spring-snapshots</id>
    <name>Spring Snapshots</name>
    <url>https://repo.spring.io/snapshot</url>
    <releases>
      <enabled>false</enabled>
    </releases>
  </repository>
</repositories>

@meistermeier meistermeier added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage An issue we've not yet triaged labels Sep 29, 2022
@meistermeier meistermeier added this to the 6.3.4 (2021.2.4) milestone Sep 29, 2022
@bonelli
Copy link
Author

bonelli commented Oct 4, 2022

I tested version 6.3.4-GH-2597-SNAPSHOT but it still OOMs when asking for a page of 20 elements of a heavy interconnected graph with ~10M nodes

I don't really know if the issue arises from this same line of code, I can't enter in debug now.
Incidentally: how can I write an integration test with such a big graph?
ATM I'm testing manually but I'd like to automatize it.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Oct 4, 2022
@meistermeier
Copy link
Collaborator

But the underlying problem is that there are a lot relationships on the next level, if I understand you correctly.
Using here the Movie graph as an example, where a Movie has various Actors (let's say in your case > 1 M ;) ).
If you query for the Movies with a Pageable, you will only reduce the amount of initial Movies getting fetched (with this patch). But you do not reduce the amount of connected Actors. This would be counter-intuitive to apply the limit also on the relationships because the Java objects will only be partially hydrated in the end.
I might got your comment wrong, but if I am right, you should think about using a custom Cypher statement and use the Pageable information within your query as described in the documentation https://docs.spring.io/spring-data/neo4j/docs/current/reference/html/#faq.custom-queries-with-page-and-slice

@meistermeier meistermeier added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Oct 6, 2022
@bonelli
Copy link
Author

bonelli commented Oct 7, 2022 via email

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Oct 7, 2022
@meistermeier
Copy link
Collaborator

Before I am going to merge the changes for the initial query and close this, I would like to talk about, what you would expect?
AFAIK Also SD Rest will only request the limit for the aggregate root and not for inner collections.

So I'm in a case in which the 2 spring libraries don't integrate nicely until I add some custom queries - it looks a bit redundant

@bonelli
Copy link
Author

bonelli commented Oct 7, 2022

This is my use-case:

  1. I filled a Neo4J database with my graph of entities (e.g. all Humans with relationships to their parents/children/spouses)
  2. perform a HTTP GET request to the main Spring Data REST collection for a certain SD Repository (e.g. http://localhost:8080/wikidataHumans)
  3. if the SD Repository implements pageable, the default findAll(Pageable) method is called for the first page with default value of pageSize=20
  4. in a reasonable time (milliseconds) I get back a body result with the first page of results

At the moment the default behavior of SD Rest is to call the default findAll(Pageable) implementation which fails with OOMs

I read about the reasons behind SDN6 hydrating the whole model in memory, but it can be very limiting in many cases with heavily interconnected graph databases.

@meistermeier
Copy link
Collaborator

But you would always run into this issue also with other SD modules. The fix above limits now the amount of root data. What happens on the next level of related entities will never get modified by the Pageable provided.
If I understood you correctly, you want also this to happen somehow. Spring Data Rest is even much more domain driven design oriented as SDN and wants to return the whole aggregate for each root entity.
And you are perfectly right with the observation that a graph and full hydration due to DDD concerns is not a match made in heaven but requires sensible mapping to avoid cloning the whole graph into memory if not needed.
Something I haven't tried out lately is the use of projections or separate repositories like described here https://docs.spring.io/spring-data/rest/docs/current/reference/html/#projections-excerpts
Will do this next week and see how well this interacts with SDN.

@bonelli
Copy link
Author

bonelli commented Dec 30, 2022

About this issue, I finally worked around it by overriding the queries for findAll and findBy and all other find* in the repositories.

I need to understand more about how I can work with a huge interconnected database with SDN, because the way I'm using it is very limiting at the moment.
If I try to use Spring Data Rest to save any entity node (with a PUT request), it detaches it from all neighbor nodes exactly at the limit of the depth I set in the custom query for the find* methods.
I understand that this might be expected, but I need a way to work with big datasets, if the dataset was small enough to fit in memory I wouldn't need a database in the first place, and my model happens to be fairly connected.

Is there any general way to write and read a single entity without having the whole database in memory?
Studying a proper orientation for the relationships is the only thing I can do to avoid the issue?

@meistermeier
Copy link
Collaborator

I think that the issue #2762 is exactly what is left from the discussion here. The limit feature got merged in context of another issue.
Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: feedback-provided Feedback has been provided
Projects
None yet
Development

No branches or pull requests

3 participants