Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Remarks on result performance (eager vs lazy loading) #430

Merged
merged 2 commits into from
Jun 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions python-manual/modules/ROOT/pages/performance.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,160 @@ for i in range(1000):
----


[[lazy-eager-loading]]
== Don't fetch large result sets all at once

When submitting queries that may result in a lot of records, don't retrieve them all at once.
The Neo4j server can retrieve records in batches, and the driver can receive one batch and _wait_ until it has been processed by the application before receiving another batch from the server.
Lazy-loading a result spreads out network traffic and memory usage.

For convenience, xref:query-simple.adoc[`.execute_query()`] always retrieves all result records at once (it is what the `Eager` in `EagerResult` stands for).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well technically (off to a great start 🙃), you can get .execute_query to discard records instead of streaming them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean with custom transformers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly.

To lazy-load a result, you have to use xref:transactions.adoc#managed-transactions[`.execute_read/write()`] (or other forms of manually-handled xref:transactions.adoc[transactions]) and *not* cast the `Result` object to `list` when processing the result; iterate on it instead.

.Comparison between eager and lazy loading
====
Consider a query that results in 250 result records, and that the driver's link:https://neo4j.com/docs/api/python-driver/current/api.html#fetch-size-ref[batch size] is set to 100 (default is 1000).
[cols="1a,1a", options="header"]
|===
|Eager loading
|Lazy loading
|
- The server has to read all 250 records from the storage before it can send even the first one the driver (i.e. it takes more time for the client to receive the first record).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? I expect the server to send the records in batches of the configured batch size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. I just kept reading. Eager loading also assumes you don't touch the batch size.

- Before any record is available to the application, the driver has to receive all 250 records.
- The client has to hold in memory all 250 records.
|
- The server reads the first 100 records and sends them to the driver.
- The application can process records as soon as the first batch is transferred.
- When the first batch has been processed, the server reads another batch and delivers it to the driver.
Further records are delivered in further batches.
- Waiting time and resource consumption (both client- and server-side) for the remaining records is deferred to when the application requests more records.
- Resource consumption is bounded by at most 100 records.
|===
.Time and memory comparison between eager and lazy loading
[source, python]
----
import neo4j
from time import sleep, time
import tracemalloc
URI = "<URI for Neo4j database>"
AUTH = ("<Username>", "<Password>")
# Returns 250 records, each with properties
# - `output` (an expensive computation, to slow down retrieval)
# - `dummyData` (a list of 10000 ints, about 8 KB).
slow_query = '''
UNWIND range(1, 250) AS s
RETURN reduce(s=s, x in range(1,1000000) | s + sin(toFloat(x))+cos(toFloat(x))) AS output,
range(1, 10000) AS dummyData
'''
# Delay for each processed record
sleep_time = 0.5
def main():
with neo4j.GraphDatabase.driver(URI, auth=AUTH) as driver:
driver.verify_connectivity()
start_time = time()
log('LAZY LOADING (execute_read)')
tracemalloc.start()
lazy_loading(driver)
log(f'Peak memory usage: {tracemalloc.get_traced_memory()[1]} bytes')
tracemalloc.stop()
log('--- %s seconds ---' % (time() - start_time))
start_time = time()
log('EAGER LOADING (execute_query)')
tracemalloc.start()
eager_loading(driver)
log(f'Peak memory usage: {tracemalloc.get_traced_memory()[1]} bytes')
tracemalloc.stop()
log('--- %s seconds ---' % (time() - start_time))
def lazy_loading(driver):
def process_records(tx):
log('Submit query')
result = tx.run(slow_query)
for record in result:
log(f'Processing record {int(record.get("output"))}')
sleep(sleep_time) # proxy for some expensive operation
with driver.session(database='neo4j') as session:
processed_result = session.execute_read(process_records)
def eager_loading(driver):
log('Submit query')
records, _, _ = driver.execute_query(slow_query, database_='neo4j')
for record in records:
log(f'Processing record {int(record.get("output"))}')
sleep(sleep_time) # proxy for some expensive operation
def log(msg):
print(f'[{round(time(), 2)}] {msg}')
if __name__ == '__main__':
main()
----
.Output
[source, output, role=nocollapse]
----
[1718014246.64] LAZY LOADING (execute_read)
[1718014246.64] Submit query
[1718014256.21] Processing record 0 // <1>
[1718014256.71] Processing record 1
...
[1718014305.33] Processing record 98
[1718014305.84] Processing record 99
[1718014315.95] Processing record 100 // <2>
[1718014316.45] Processing record 101
...
[1718014394.92] Processing record 248
[1718014395.42] Processing record 249
[1718014395.92] Peak memory usage: 37694 bytes
[1718014395.92] --- 149.2824890613556 seconds ---
[1718014395.92] EAGER LOADING (execute_query)
[1718014395.92] Submit query
[1718014419.82] Processing record 0 // <3>
[1718014420.33] Processing record 1
...
[1718014468.9] Processing record 98
[1718014469.4] Processing record 99
[1718014469.9] Processing record 100 // <4>
[1718014470.4] Processing record 101
...
[1718014544.02] Processing record 248
[1718014544.52] Processing record 249
[1718014545.02] Peak memory usage: 80222 bytes // <5>
[1718014545.02] --- 149.10213112831116 seconds --- // <6>
----
<1> With lazy loading, the first record is available ~10 seconds after the query is submitted (i.e. as soon as the server has retrieved the first batch of 100 records).
<2> It takes about the same time to receive the second batch as it took for the first batch (similar for subsequent batches).
<3> With eager loading, the first record is available ~25 seconds after the query has been submitted submitted (i.e. after the server has retrieved all 250 records).
<4> There's no delay in between batches: the processing time between any two records is the same.
<5> Memory usage is larger with eager loading than with lazy loading, because the application materializes a list of 250 records (while in lazy loading it's never more than 100).
<6> The total running time is practically the same, but lazy loading spreads that out across batches whereas eager loading has one longer waiting period.
With lazy loading, the client could also stop requesting records after some condition is met, saving time and resources.
====


== Route read queries to cluster readers

In a cluster, *route read queries to link:{neo4j-docs-base-uri}/operations-manual/current/clustering/introduction/#clustering-secondary-mode[secondary nodes]*. You do this by:
Expand Down