SimpleNeo4jRepository.findAll(Pageable) fails on big databases #2597

bonelli · 2022-09-19T14:51:55Z

Invoking SimpleNeo4jRepository.findAll(Pageable) will cause OOME on databases with many nodes, even with a small page size.

After some debugging, I found the probable cause in these lines of the Neo4jTemplate.createNodesAndRelationshipsByIdStatementProvider() method:

spring-data-neo4j/src/main/java/org/springframework/data/neo4j/core/Neo4jTemplate.java

Lines 1142 to 1144 in 402c0ba

    
           Statement rootNodesStatement = cypherGenerator 
        
           		.prepareMatchOf(entityMetaData, queryFragments.getMatchOn(), queryFragments.getCondition()) 
        
           		.returning(Constants.NAME_OF_SYNTHESIZED_ROOT_NODE).build();

The rootNodesStatement that is used to detect circular dependencies is quering the database for all nodes of the given type, extracting their ID.

This is no big deal on small databases, but in my case I have a single database with +10M nodes of the given type (e.g. Foo nodes), and the JVM starts downloading many millions of IDs from the database, eventually crashing for OOME or failing the execution for query timeout.

Is it essential to request the complete set of IDs for all of the Foo entities in the database, even when I'm just trying to fetch one page with 20sh elements?

The text was updated successfully, but these errors were encountered:

meistermeier · 2022-09-29T10:37:04Z

Thanks for reporting this. There is definitely a missed skip/limit in the initial query.
I am pretty sure that the 6.3.4-GH-2597-SNAPSHOT will help you with this. It would be very helpful for us, if you could try them with the request above on your domain.

If you don't have it done already, here are the repository definitions for Spring snapshots and milestones to be included in your repository definition.

<repositories>
  <repository>
    <id>spring-milestones</id>
    <name>Spring Milestones</name>
    <url>https://repo.spring.io/milestone</url>
    <snapshots>
      <enabled>false</enabled>
    </snapshots>
  </repository>
  <repository>
    <id>spring-snapshots</id>
    <name>Spring Snapshots</name>
    <url>https://repo.spring.io/snapshot</url>
    <releases>
      <enabled>false</enabled>
    </releases>
  </repository>
</repositories>

bonelli · 2022-10-04T15:44:42Z

I tested version 6.3.4-GH-2597-SNAPSHOT but it still OOMs when asking for a page of 20 elements of a heavy interconnected graph with ~10M nodes

I don't really know if the issue arises from this same line of code, I can't enter in debug now.
Incidentally: how can I write an integration test with such a big graph?
ATM I'm testing manually but I'd like to automatize it.

meistermeier · 2022-10-06T09:39:45Z

But the underlying problem is that there are a lot relationships on the next level, if I understand you correctly.
Using here the Movie graph as an example, where a Movie has various Actors (let's say in your case > 1 M ;) ).
If you query for the Movies with a Pageable, you will only reduce the amount of initial Movies getting fetched (with this patch). But you do not reduce the amount of connected Actors. This would be counter-intuitive to apply the limit also on the relationships because the Java objects will only be partially hydrated in the end.
I might got your comment wrong, but if I am right, you should think about using a custom Cypher statement and use the Pageable information within your query as described in the documentation https://docs.spring.io/spring-data/neo4j/docs/current/reference/html/#faq.custom-queries-with-page-and-slice

bonelli · 2022-10-07T06:26:59Z

Yeah I believe you have understood perfectly my case, my workaround was indeed a custom query - I'll stick with it since it is the only way to hydrate up to a limited depth. It is worth noticing, though, that my use case is using Spring Data Rest, which uses HATEOAS to resolve the depth problem on the JSON serialization side. So I'm in a case in which the 2 spring libraries don't integrate nicely until I add some custom queries - it looks a bit redundant, but I don't have a suggestion on how to solve the redundancy. Ottenere Outlook per Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Gerrit Meier ***@***.***> Sent: Thursday, October 6, 2022 11:39:57 AM To: spring-projects/spring-data-neo4j ***@***.***> Cc: Federico Bonelli ***@***.***>; Author ***@***.***> Subject: Re: [spring-projects/spring-data-neo4j] SimpleNeo4jRepository.findAll(Pageable) fails on big databases (Issue #2597) But the underlying problem is that there are a lot relationships on the next level, if I understand you correctly. Using here the Movie graph as an example, where a Movie has various Actors (let's say in your case > 1 M ;) ). If you query for the Movies with a Pageable, you will only reduce the amount of initial Movies getting fetched (with this patch). But you do not reduce the amount of connected Actors. This would be counter-intuitive to apply the limit also on the relationships because the Java objects will only be partially hydrated in the end. I might got your comment wrong, but if I am right, you should think about using a custom Cypher statement and use the Pageable information within your query as described in the documentation https://docs.spring.io/spring-data/neo4j/docs/current/reference/html/#faq.custom-queries-with-page-and-slice — Reply to this email directly, view it on GitHub<#2597 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB5RRGP2GVJMQ4XI7OY7ZTTWB2M63ANCNFSM6AAAAAAQQF4CMA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

meistermeier · 2022-10-07T10:00:05Z

Before I am going to merge the changes for the initial query and close this, I would like to talk about, what you would expect?
AFAIK Also SD Rest will only request the limit for the aggregate root and not for inner collections.

So I'm in a case in which the 2 spring libraries don't integrate nicely until I add some custom queries - it looks a bit redundant

bonelli · 2022-10-07T15:40:58Z

This is my use-case:

I filled a Neo4J database with my graph of entities (e.g. all Humans with relationships to their parents/children/spouses)
perform a HTTP GET request to the main Spring Data REST collection for a certain SD Repository (e.g. http://localhost:8080/wikidataHumans)
if the SD Repository implements pageable, the default findAll(Pageable) method is called for the first page with default value of pageSize=20
in a reasonable time (milliseconds) I get back a body result with the first page of results

At the moment the default behavior of SD Rest is to call the default findAll(Pageable) implementation which fails with OOMs

I read about the reasons behind SDN6 hydrating the whole model in memory, but it can be very limiting in many cases with heavily interconnected graph databases.

meistermeier · 2022-10-08T05:50:55Z

But you would always run into this issue also with other SD modules. The fix above limits now the amount of root data. What happens on the next level of related entities will never get modified by the Pageable provided.
If I understood you correctly, you want also this to happen somehow. Spring Data Rest is even much more domain driven design oriented as SDN and wants to return the whole aggregate for each root entity.
And you are perfectly right with the observation that a graph and full hydration due to DDD concerns is not a match made in heaven but requires sensible mapping to avoid cloning the whole graph into memory if not needed.
Something I haven't tried out lately is the use of projections or separate repositories like described here https://docs.spring.io/spring-data/rest/docs/current/reference/html/#projections-excerpts
Will do this next week and see how well this interacts with SDN.

bonelli · 2022-12-30T15:22:02Z

About this issue, I finally worked around it by overriding the queries for findAll and findBy and all other find* in the repositories.

I need to understand more about how I can work with a huge interconnected database with SDN, because the way I'm using it is very limiting at the moment.
If I try to use Spring Data Rest to save any entity node (with a PUT request), it detaches it from all neighbor nodes exactly at the limit of the depth I set in the custom query for the find* methods.
I understand that this might be expected, but I need a way to work with big datasets, if the dataset was small enough to fit in memory I wouldn't need a database in the first place, and my model happens to be fairly connected.

Is there any general way to write and read a single entity without having the whole database in memory?
Studying a proper orientation for the relationships is the only thing I can do to avoid the issue?

meistermeier · 2023-12-14T11:09:58Z

I think that the issue #2762 is exactly what is left from the discussion here. The limit feature got merged in context of another issue.
Closing this.

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Sep 19, 2022

meistermeier self-assigned this Sep 20, 2022

meistermeier added a commit that referenced this issue Sep 29, 2022

GH-2597 - Prepare branch.

85d3e25

meistermeier added a commit that referenced this issue Sep 29, 2022

GH-2597 - Apply skip/limit at the right place for circular queries.

56d2a7c

meistermeier added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage An issue we've not yet triaged labels Sep 29, 2022

meistermeier added this to the 6.3.4 (2021.2.4) milestone Sep 29, 2022

spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Oct 4, 2022

meistermeier added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Oct 6, 2022

spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Oct 7, 2022

meistermeier closed this as completed Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleNeo4jRepository.findAll(Pageable) fails on big databases #2597

SimpleNeo4jRepository.findAll(Pageable) fails on big databases #2597

bonelli commented Sep 19, 2022

meistermeier commented Sep 29, 2022

bonelli commented Oct 4, 2022

meistermeier commented Oct 6, 2022

bonelli commented Oct 7, 2022 via email

meistermeier commented Oct 7, 2022

bonelli commented Oct 7, 2022

meistermeier commented Oct 8, 2022

bonelli commented Dec 30, 2022

meistermeier commented Dec 14, 2023

SimpleNeo4jRepository.findAll(Pageable) fails on big databases #2597

SimpleNeo4jRepository.findAll(Pageable) fails on big databases #2597

Comments

bonelli commented Sep 19, 2022

meistermeier commented Sep 29, 2022

bonelli commented Oct 4, 2022

meistermeier commented Oct 6, 2022

bonelli commented Oct 7, 2022 via email

meistermeier commented Oct 7, 2022

bonelli commented Oct 7, 2022

meistermeier commented Oct 8, 2022

bonelli commented Dec 30, 2022

meistermeier commented Dec 14, 2023