Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving many entities with relationships using projections causes poor performance and increases Forseti deadlocks errors when compared to OGM #2666

Open
justodiaz opened this issue Feb 7, 2023 · 6 comments
Assignees
Labels
status: waiting-for-triage An issue we've not yet triaged

Comments

@justodiaz
Copy link

justodiaz commented Feb 7, 2023

We are using:
spring-data-neo4j 6.3.5
neo4j 4.3.10

As described in the reference doc, saving relationships for multiple entities tends to generate and run several queries.

It goes something like:

  1. Save root entities (this appears to be done in a single "batch" UNWIND query)
  2. For a given relationship type, If entity is not new, DELETE existing relationships.
  3. Create or Update the related entity (or entities if a -to-many relationship) using the CREATE and UNION MATCH statement
  4. Run the MERGE query (uses UNWIND for -to-many relationships).
  5. If "save depth" is not limited , traverse the related reference and repeat starting at step 2 for the related entity
  6. Handle the next relationship type starting at step 2.

We can see all these queries being logged in Neo4j query.log.

Assume we have a model

@Node(primaryLabel = "Root")
public class Root {
  @Id
  @Property(name = "id")
  @GeneratedValue(value = OurCustomIdStrategy.class)
  String id;
    
  @Relationship(type = "FIRST_REL_TYPE")
  Child child;

 @Relationship(type = "SECOND_REL_TYPE")
  List<OtherChild> otherChildren;
}

and we wanted to save multiple Root entities including their relationships to Child or OtherChild. We are just interested in creating the relationship to existing entities, so assume Child and OtherChild entities are already persisted in the DB.

We currently run something like:

List<RootEntities> rootEntities = getManyRootEntities();

neo4jTemplate.saveAllAs(rootEntities, RootEntityProjection.class)

where the projection is defined as

public interface RootEntityProjection {
  String getId();
  String getPropA();

  Child getChild();

  List<OtherChild> getOtherChildren();
 
  interface Child {
    String getId();
  }

  interface OtherChild {
    String getId();
  }
}

This works as expected, it will create all the root entities and correctly creates the relationship with MERGE (startNode)-[:FIRST_REL_TYPE]->(endNode) etc.

However, it doesn't seem to scale in terms of performance when the number of root entities and relationships to save is large. We noticed from the Neo4j logs that SDN appears to inefficiently run an update property query for each related entity for each root entity. It will also run a separate MERGE query for each root entity for each relationship type. This is in contrast to OGM which appeared to do a "batch" UNWIND MERGE query per relationship type. SDN appears to only do it for the -to-many relationship type and for the root entities' properties.

We have tried removing the ids from the projections like so:

  ...
  Child getChild();

  List<OtherChild> getOtherChildren();

  interface Child { }

  interface OtherChild {  }
}

but at a minimum the queries will always update the id, because we are using generated id values for all of our entities. I suppose we could not use generated id values, but that is not a solution we can consider.

We have also noticed duplicate queries being ran if for example there were multiple root entities that had a reference to the same related entity:

  Root root1 = new Root();
  Root root2 = new Root();

  Child child = repository.findById("child123");
  root1.setChild(child);
  root2.setChild(child);

  var roots = List.of(root1, root2);
  neo4jTemplate.saveAll(roots);

in this example, a query to update the properties of Child will run twice.

Also if the root entity is not new, all existing relationships are first deleted, which seems inefficient if some of them are expected to just be re-created in the next MERGE query.

Is there a way we can avoid generating update queries for the related entity properties if we just want to create the relationship between existing entities?

Furthermore, we have noticed an increased in Neo4j Forseti deadlock errors when compared to OGM. From our investigation and from Neo4j official documentation, we can reduce the amount of deadlocks errors by ordering our SET and MERGE queries and whenever possible we should use a CREATE query instead of MERGE to create relationships. Currently, SDN doesn't seem to follow a consistent order, even when ordering the relationships inside the projection interfaces.

Is there a way we can specify SDN to use CREATE instead of MERGE dynamically at run time? Also is there a way to avoid updating related entity properties and order the MERGE queries so that the same order is preserved in between restarts of the app?

I did notice there were existing issues that seem to report on similar findings:
#2235
#2636
#2593
#2588

Will these performance and deadlocking issues be resolved in upcoming versions?

Is there an alternative, more performant, way that avoids running so many queries to save relationships for root entities?

Currently, we have decided to define our own custom UNWIND @Query() repository queries instead of using the template Neo4jOperations, saveAll(), saveAllAs() etc. methods.

something like

  @Query("UNWIND keys($map) AS valueMapId "
      + "MATCH (valueMap:ValueMap {id: valueMapId})  "
      + "OPTIONAL MATCH (valueMap)-[created_by:CREATED_BY]->(:User) "
      + "DELETE created_by "
      + "WITH DISTINCT valueMap, valueMapId "
      + "MATCH (user:User {id: $map[valueMapId]}) "
      + "MERGE (valueMap)-[:CREATED_BY]->(user);")
  void replaceCreatedByRelationship(Map<String, String> map);

which allows us to batch create relationships while avoiding the need to update related entity properties. We also have control of whether to use the CREATE clause instead of MERGE to avoid deadlocks.

We have seen significant improvements both in performance and reduction of Forseti deadlock errors when saving a large amount of root entities.

We are still wondering, however, if there are any other recommendations we can implement using SDN that can help in a similar way and will allow us to delegate relationship creation back to SDN.

Thank you!

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Feb 7, 2023
@meistermeier meistermeier self-assigned this Feb 8, 2023
@meistermeier
Copy link
Collaborator

Thanks for the comprehensive report.
I spend some time to go through the multiple points, you have raised, and try to explain why certain decisions have been taken, how to mitigate some of them and what we are working on. (Maybe I repeat some parts from other issues/the documentation unintentionally here but I want to draw the whole picture).

The difference between SDN and Neo4j-OGM when in comes to updating/saving you are observing is based on the fact that Spring Data Neo4j does not have a cache or dirty tracking mechanism in place. This is a direction we have chosen based on our experience with Neo4j-OGM to avoid an additional virtual graph on the application side if there's already the Java object-graph that should get written to the database.
As a result, SDN does not know anything on save about the state of the target database. It sure has some ideas, like entities with generated ids != null (or provided ids + version property set) represent existing instances, but no information of the changes within an entity. That's why every reachable entity in the object graph gets updated even though nothing has changed.
Also simple relationships have this remove/re-create process because we don't know if a relationship got deleted or was created before. I mentioned simple relationship explicitly here to point out the difference to the classes annotated with @RelationshipProperties that represent a relationship in the Java model. The advantage of using this class between your Root and Child classes is the knowledge about the internal id of the relationship.
SDN makes use of this information during the save process. First, relationships with ids won't get deleted. The delete statement gets still issued but with a exclusion list containing those ids. Secondly, the relationships will get updated in batches. Last but not least unknown (new) relationships will get created with a CREATE statement instead of the MERGE call.

I think there is some potential in having knowledge about the existing relationships vs. new relationships to improve the save process even more. E.g. I am currently trying to avoid the merge at all for existing relationships.
Also, as you have pointed out correctly, from a theoretical view, the child nodes should get updated also in batches, if possible.

@justodiaz
Copy link
Author

Thanks for the detailed response! We almost never use @RelationshipProperties because most of our defined relationships don't have additional properties so we can experiment with that and check for improvements.

We also are curious about the ordering of relationship type creation. We have noticed both in OGM and SDN that the order in which different relationships get created is not consistent.
Using the same example from above

  ...
  @Relationship(type = "FIRST_REL_TYPE")
  Child child;

 @Relationship(type = "SECOND_REL_TYPE")
  List<OtherChild> otherChildren;
}

Saving this entity means it will need to save a relationship to Child and OtherChild. Sometimes it will first merge/create to the Child followed by the OtherChild, and other times reversed where the OtherChild is first then the Child.

Is ordering something that can be configured with SDN? Thanks!

@meistermeier
Copy link
Collaborator

Although technical it would be possible to have a fixed order of property parsing (plain property + relationship definitions), this order would be fixed for every user. This would be done by a Comparator that we hand over to the Spring Data core module. For the simple example of sorting by property name, the consequence would be to name the properties in the Java domain model correctly for the expected order.
I can see a tiny benefit of having this fixed order: Reproducible errors. For everything else it is much likely that the answer to a problem is potentially somewhere else to find. Re-ordering by renaming (referring to the example of sorting by name) shouldn't be the solution.

@justodiaz
Copy link
Author

Thanks! Sounds like if we wanted to order when the relationship types are created across all entities, in the meantime we would have to use a separate implementation. Until SDN supports some sort of ordering.

@justodiaz
Copy link
Author

justodiaz commented Feb 13, 2023

Also, where would we pass in the Comparator? Is there a reference doc we can look at?
When you say

For everything else it is much likely that the answer to a problem is potentially somewhere else to find

did you mean ordering shouldn't be a solution? We know from Neo4j that ordering the MERGE queries can help reduce deadlocks. Or are you referring to other problems/errors? Thanks!

@meistermeier
Copy link
Collaborator

There is currently no way to provide a custom Comparator for ordering the fields. This would have to get implemented first. There is just an API in Spring Data Core that gets called from SDN without any customisation regarding ordering right now.

Also, I just realised that I got you wrong on the ordering in general. I was talking about the properties on the Root and not the properties of an item in the defined lists. There isn't any support in SDN (or Spring Data core) for this.
This would mean that every generated query with relationships would need additional order clauses for each relationship for reading. On the writing side, the ordering of List items should stay the same for non-dynamic relationships. So, if you know how the order should look like, you can the List on the Root entity upfront the save call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-triage An issue we've not yet triaged
Projects
None yet
Development

No branches or pull requests

3 participants