Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects #52

Closed
andreiIfrim opened this issue May 17, 2016 · 8 comments

Comments

@andreiIfrim
Copy link

andreiIfrim commented May 17, 2016

Description
For one to many relationship, after migrating to datanucleus 4 the JDO queries are having a very low performance. Datanucelus brings all sub-objects in memory and for tables with millions of entries this takes the CPU at 100%.

Model example

@PersistenceCapable(detachable` = "true")
public class Activity {
    @Persistent(valueStrategy = IdGeneratorStrategy.UUIDHEX)
    @Column(jdbcType = "VARCHAR", length = 32)
    private String encodedKey;

    @Persistent(defaultFetchGroup = "true")
    @Element(dependent = "true")
    private List<FieldChangeItem> fieldChanges = null;
.....
}

When fetching the activities without any filter all field changes are taken into the memory.
Generated SQL that is causing problems:

SELECT 
    '....model.FieldChangeItem' AS `NUCLEUS_TYPE`,
    ....
FROM
    `FIELDCHANGEITEM` `A0`
WHERE
    `A0`.`FIELDCHANGES_INTEGER_IDX` >= 0
        AND EXISTS( SELECT 
            '.....model.Activity' AS `NUCLEUS_TYPE`,
                `A0_SUB`.`ENCODEDKEY` AS `DN_APPID`
        FROM
            `ACTIVITY` `A0_SUB`
        WHERE
            `A0`.`FIELDCHANGES_ENCODEDKEY_OWN` = `A0_SUB`.`ENCODEDKEY`)
ORDER BY `NUCORDER0`

Note
The problem is present for defined fetch groups as well, not only for defaultFetchGroup = "true".

Used Workaround
Removed defaultFetchGroup = "true" and made lazy programatic fetching for this list. In this case filtered by parent key queries were generated when the field change items were fetched.

@andyjefferson andyjefferson changed the title Low performance SQL are generated for Objects that are having lists of other objects Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects May 17, 2016
@andyjefferson
Copy link
Member

andyjefferson commented May 17, 2016

You mean BULK FETCH, that was introduced in DN 4.x, generates SQL to attempt to load objects now, whereas in earlier versions it simply ignored the users fetch plan and you had lazy loading always. If you think the SQL could/should be improved then you can kindly go to the code and contribute improvements. Here is the starting point
https://github.com/datanucleus/datanucleus-rdbms/blob/master/src/main/java/org/datanucleus/store/rdbms/query/BulkFetchExistsHelper.java

@andreiIfrim
Copy link
Author

Thank you for your fast response Andy. I will have a look on how this can be done better!

@andyjefferson
Copy link
Member

No comment on how this can be improved, nor a test that demonstrates the problem, so closing. Feel free to comment here and it could be reopened, or raise a new issue with some specific improvement and pull request that provides it.

@jonathanvx
Copy link

jonathanvx commented Mar 11, 2017

The way to fix this is choosing JOIN over EXISTS (SELECT ..FROM).
As in, use BulkFetchJoinHandler.java instead of BulkFetchExistsHandler.java

@andyjefferson
Copy link
Member

Actually not totally correct; the OP is talking about loading objects into memory, NOT the actual SQL invoked. No demonstration of this was presented.

A comparison of different methods for bulk-fetching is available from the authors of EclipseLink, https://java-persistence-performance.blogspot.co.uk/2010/08/batch-fetching-optimizing-object-graph.html, and while JOIN actually comes out best with a simple query, it comes out poorest when the query it is utilised with becomes complex.

Clearly if someone is particularly interested in this feature for their projects they can contribute time to provide the JOIN bulk-fetch implementation, as well as investigate what this issue is actually about.

@jonathanvx
Copy link

Unfortunately, I am a MySQL performance engineer and not a developer. Therefore, I am only able to analyse the SQL thats hitting the database and make recommendations on how to fix them. I can tell you that the vast majority of cases would benefit (sometimes greatly) from JOIN over EXISTS. In the event where you have many JOINs, this would be a more complicated issue that may be solved with better indexes, index hints... but it is more of a specialist issue at that point.
Your decision to use EXISTS means that you usually do a full table scan on the main table while using a subquery where the data in that subquery is dependant on data in the main table. This is really the worst sort of performance (at least for MySQL).
Bringing a lot of blocks into memory and then filtering them - instead of using indexes - leads to high IO and high CPU which is what the OP was complaining about. Either way, it is well worth looking at it.

@jonathanvx
Copy link

What would be great to know if there is a queryHint or some annotation where we can help datanucleus decide whether to use a JOIN or an EXISTS and then we can see for ourselves if there is a performance improvement.

@andyjefferson
Copy link
Member

andyjefferson commented Mar 11, 2017

As my comment above said, JOIN is not yet implemented; needs resource to do it and not a priority for me. So if your project/company needs it then it is there to do, and any comments relating to it should be put against the correct issue ... #171 and the query extension is documented at http://www.datanucleus.org/products/accessplatform_5_0/jdo/query.html#FetchPlan as datanucleus.rdbms.query.multivaluedFetch with values of "exists" and "none" currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants