Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects #52

andreiIfrim · 2016-05-17T08:07:06Z

Description
For one to many relationship, after migrating to datanucleus 4 the JDO queries are having a very low performance. Datanucelus brings all sub-objects in memory and for tables with millions of entries this takes the CPU at 100%.

Model example

@PersistenceCapable(detachable` = "true")
public class Activity {
    @Persistent(valueStrategy = IdGeneratorStrategy.UUIDHEX)
    @Column(jdbcType = "VARCHAR", length = 32)
    private String encodedKey;

    @Persistent(defaultFetchGroup = "true")
    @Element(dependent = "true")
    private List<FieldChangeItem> fieldChanges = null;
.....
}

When fetching the activities without any filter all field changes are taken into the memory.
Generated SQL that is causing problems:

SELECT 
    '....model.FieldChangeItem' AS `NUCLEUS_TYPE`,
    ....
FROM
    `FIELDCHANGEITEM` `A0`
WHERE
    `A0`.`FIELDCHANGES_INTEGER_IDX` >= 0
        AND EXISTS( SELECT 
            '.....model.Activity' AS `NUCLEUS_TYPE`,
                `A0_SUB`.`ENCODEDKEY` AS `DN_APPID`
        FROM
            `ACTIVITY` `A0_SUB`
        WHERE
            `A0`.`FIELDCHANGES_ENCODEDKEY_OWN` = `A0_SUB`.`ENCODEDKEY`)
ORDER BY `NUCORDER0`

Note
The problem is present for defined fetch groups as well, not only for defaultFetchGroup = "true".

Used Workaround
Removed defaultFetchGroup = "true" and made lazy programatic fetching for this list. In this case filtered by parent key queries were generated when the field change items were fetched.

The text was updated successfully, but these errors were encountered:

andyjefferson · 2016-05-17T08:26:01Z

You mean BULK FETCH, that was introduced in DN 4.x, generates SQL to attempt to load objects now, whereas in earlier versions it simply ignored the users fetch plan and you had lazy loading always. If you think the SQL could/should be improved then you can kindly go to the code and contribute improvements. Here is the starting point
https://github.com/datanucleus/datanucleus-rdbms/blob/master/src/main/java/org/datanucleus/store/rdbms/query/BulkFetchExistsHelper.java

andreiIfrim · 2016-05-18T07:47:48Z

Thank you for your fast response Andy. I will have a look on how this can be done better!

andyjefferson · 2017-03-07T13:36:21Z

No comment on how this can be improved, nor a test that demonstrates the problem, so closing. Feel free to comment here and it could be reopened, or raise a new issue with some specific improvement and pull request that provides it.

jonathanvx · 2017-03-11T12:49:06Z

The way to fix this is choosing JOIN over EXISTS (SELECT ..FROM).
As in, use BulkFetchJoinHandler.java instead of BulkFetchExistsHandler.java

andyjefferson · 2017-03-11T13:18:58Z

Actually not totally correct; the OP is talking about loading objects into memory, NOT the actual SQL invoked. No demonstration of this was presented.

A comparison of different methods for bulk-fetching is available from the authors of EclipseLink, https://java-persistence-performance.blogspot.co.uk/2010/08/batch-fetching-optimizing-object-graph.html, and while JOIN actually comes out best with a simple query, it comes out poorest when the query it is utilised with becomes complex.

Clearly if someone is particularly interested in this feature for their projects they can contribute time to provide the JOIN bulk-fetch implementation, as well as investigate what this issue is actually about.

jonathanvx · 2017-03-11T16:49:39Z

Unfortunately, I am a MySQL performance engineer and not a developer. Therefore, I am only able to analyse the SQL thats hitting the database and make recommendations on how to fix them. I can tell you that the vast majority of cases would benefit (sometimes greatly) from JOIN over EXISTS. In the event where you have many JOINs, this would be a more complicated issue that may be solved with better indexes, index hints... but it is more of a specialist issue at that point.
Your decision to use EXISTS means that you usually do a full table scan on the main table while using a subquery where the data in that subquery is dependant on data in the main table. This is really the worst sort of performance (at least for MySQL).
Bringing a lot of blocks into memory and then filtering them - instead of using indexes - leads to high IO and high CPU which is what the OP was complaining about. Either way, it is well worth looking at it.

jonathanvx · 2017-03-11T17:26:09Z

What would be great to know if there is a queryHint or some annotation where we can help datanucleus decide whether to use a JOIN or an EXISTS and then we can see for ourselves if there is a performance improvement.

andyjefferson · 2017-03-11T17:45:22Z

As my comment above said, JOIN is not yet implemented; needs resource to do it and not a priority for me. So if your project/company needs it then it is there to do, and any comments relating to it should be put against the correct issue ... #171 and the query extension is documented at http://www.datanucleus.org/products/accessplatform_5_0/jdo/query.html#FetchPlan as datanucleus.rdbms.query.multivaluedFetch with values of "exists" and "none" currently.

andyjefferson changed the title ~~Low performance SQL are generated for Objects that are having lists of other objects~~ Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects May 17, 2016

andyjefferson added the testcase_required label Jun 22, 2016

andyjefferson closed this as completed Mar 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects #52

Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects #52

andreiIfrim commented May 17, 2016 •

edited

Loading

andyjefferson commented May 17, 2016 •

edited

Loading

andreiIfrim commented May 18, 2016

andyjefferson commented Mar 7, 2017

jonathanvx commented Mar 11, 2017 •

edited

Loading

andyjefferson commented Mar 11, 2017

jonathanvx commented Mar 11, 2017

jonathanvx commented Mar 11, 2017

andyjefferson commented Mar 11, 2017 •

edited

Loading

Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects #52

Bulk Fetch : Low performance SQL are generated for Objects that are having lists of other objects #52

Comments

andreiIfrim commented May 17, 2016 • edited Loading

andyjefferson commented May 17, 2016 • edited Loading

andreiIfrim commented May 18, 2016

andyjefferson commented Mar 7, 2017

jonathanvx commented Mar 11, 2017 • edited Loading

andyjefferson commented Mar 11, 2017

jonathanvx commented Mar 11, 2017

jonathanvx commented Mar 11, 2017

andyjefferson commented Mar 11, 2017 • edited Loading

andreiIfrim commented May 17, 2016 •

edited

Loading

andyjefferson commented May 17, 2016 •

edited

Loading

jonathanvx commented Mar 11, 2017 •

edited

Loading

andyjefferson commented Mar 11, 2017 •

edited

Loading