Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot #121636

martijnvg · 2025-02-04T08:04:22Z

Improve LuceneSyntheticSourceChangesSnapshot by triggering to use a sequential stored field reader if docids are dense. This is done by computing for which docids to synthesize recovery source for. If the requested docids are dense and monotonic increasing a sequential stored field reader is used, which provided recovery source for many documents without repeatedly de-compressing the same block of stored fields.

Without a change like this, synthetic recovery source is ~2 times slower compared to stored recovery source. See this dashboard that shows total read time in follower cluster broken down by shard: https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/pFChT

Baseline is stored recovery source and contender is synthetic recovery source. The benchmark is performed using a new elastic/logs challenge (elastic/rally-tracks#734), that configures auto following of logs-* into the local cluster.

After investigation, the diff between operations read between baseline and contender was caused by decompressing stored fields over and over again:

(this happened, because for each operation, a stored field block gets de-compressed, but only stored fields for one document is read. The next op is very likely to de-compress the same stored field block)

The same benchmark with this change:
https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/kVgM2

With the change synthetic recovery source is more than 2 times faster compared to synthetic recovery source. This matches with what we observed in earlier ad-hoc / micro benchmarks.

Labelling this as a non-issue. This is a performance bug, but synthetic recovery source hasn't been released yet.

martijnvg · 2025-02-04T08:05:35Z

server/src/main/java/org/elasticsearch/index/engine/LuceneSyntheticSourceChangesSnapshot.java

-        this.storedFieldLoader = StoredFieldLoader.create(false, storedFields);
+        String codec = EngineConfig.INDEX_CODEC_SETTING.get(mapperService.getIndexSettings().getSettings());
+        boolean shouldForceSequentialReader = CodecService.BEST_COMPRESSION_CODEC.equals(codec);
+        this.storedFieldLoader = StoredFieldLoader.create(false, storedFields, shouldForceSequentialReader);


Maybe we should use always force a sequential reader. I think no matter the stored field format, it has benefits given how LuceneSyntheticSourceChangesSnapshot access stored fields.

I am not entirely sure. Does this not depend upon how jumpy the data is, which with index sorting could be very jumpy - and whether there is more than one stored field per doc (not sure that is always true?)?

Right, this is why then changed to logic not look at index.codec setting anymore.

elasticsearchmachine · 2025-02-04T09:53:52Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-02-04T09:53:52Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

tlrx · 2025-02-04T10:02:31Z

@martijnvg looks like both links point to the same benchmark?

martijnvg · 2025-02-04T10:14:42Z

@tlrx Oops, I've update the link.

martijnvg · 2025-02-04T10:26:35Z

(I updated the description with more information)

jimczi · 2025-02-04T11:49:53Z

This feels a bit counterintuitive. Is it because the CCR buffer is big enough that reads end up being sequential? Also, is index sorting applied to the rally track? I’m trying to understand why this helps since we said index sorting would make sequential reads worse, yet this change applies it all the time.

martijnvg · 2025-02-04T12:48:59Z

Also, is index sorting applied to the rally track?

Yes, this is based on the elastic/logs track which sets index.mode to logsdb (which means index sort fields are host.name and @timestamp).

I’m trying to understand why this helps since we said index sorting would make sequential reads worse, yet this change applies it all the time.

Reader is a misleading word in the title. This is just using a stored field reader implementation that decompresses an entire block at a time (This delegate to SequentialStoredFieldsLeafReader#getSequentialStoredFieldsReader(...) which then delegates to StoredFieldsReader#getMergeInstance()). In LuceneSyntheticSourceChangesSnapshot#transformScoreDocsToRecords(...) we always access stored fields in docid order (based on the requested seqno range). With this change, a stored field block is de-compressed eagerly and each subsequent stored field read can access without de-compressing (until next block). This is beneficial due how stored fields are accessed in LuceneSyntheticSourceChangesSnapshot. Also operations that haven't been replicated yet typically have higher timestamps (compared to already indexed logs), so it is likely to align with the index sorting.

jimczi · 2025-02-04T13:10:22Z

My comment was more that it's beneficial for a specific layout, the block of doc ids to retrieve is dense but here we apply it all the time. Performance in this mode might degrade when adding more hosts or when more diverse ingestion happens?
Or when the buffer is smaller?
Can we get some statistics from this rally track? Something like average gap size between documents?
In non-synthetic mode we apply sequential reader only if the entire range is completely dense. We can be more granular here and apply the heuristic for each block of doc ids that we load?

martijnvg · 2025-02-04T13:30:05Z

Performance in this mode might degrade when adding more hosts or when more diverse ingestion happens?

For the elastic/logs track, some data streams have 0 or 1 host, some have a few and some up to 50 hosts iirc.
The results from Rally's ccr telemetry device (which tracks follower stats by shard) suggest that all data streams benefit from this change.

Or when the buffer is smaller?

Right, if the buffer is small then maybe sequential stored field reader is the right choice.

Can we get some statistics from this rally track? Something like average gap size between documents?

You mean the docid gap between requested seqno? I will look into this.

In non-synthetic mode we apply sequential reader only if the entire range is completely dense. We can be more granular here and apply the heuristic for each block of doc ids that we load?

Yes, the non-synthetic implementation selectively uses sequential stored field reader based on the range of requested docid. Like you said we can do that here too and be a little bit less strict. We can in the LuceneSyntheticSourceChangesSnapshot#loadDocuments(...) method to utilize sequential stored field reader is requested docids are close to each other.

henningandersen · 2025-02-04T13:49:40Z

server/src/main/java/org/elasticsearch/index/engine/LuceneSyntheticSourceChangesSnapshot.java

+        //
+        // A sequential reader decompresses a block eagerly, so that increasing adjacent doc ids can access stored fields without
+        // compressing on each StoredFields#document(docId) invocation. The only downside is the last few operations in the request
+        // seq_no range are at the beginning of a block, which means stored fields for many docs are being decompressed that isn't used.


Is this true also for zstd? Or rather would that always decompress the entire thing?

I think it both true for zstd or deflate, since both de-compress an entire block.

…gesSnapshot_sequential_stored_reader

martijnvg · 2025-02-04T17:08:24Z

Can we get some statistics from this rally track? Something like average gap size between documents?

I'm printing average docid gap in LuceneSyntheticSourceChangesSnapshot#transformScoreDocsToRecords(...). Very often I see average docid gap of below 10 per ~1000 docs. Which explains the positive effect of this change.

martijnvg · 2025-02-05T11:14:04Z

One aspect of investigating the performance difference between stored and synthetic recovery source that isn't included in this PR history, is that ccr replication is so much slower with zstd compressed stored fields compared to previous default for best compression (deflate) (see results). The deflate stored field decompressor can skip decompressing unneeded blocks, whereas with the zstd decompressor that isn't the case. Given that all docids are being visited with ccr replication, a lot of blocks are being repetitively de-compressed. This is much heavier with the current default stored field codec with logsdb (zstd). This PR tries to address this at the LuceneSyntheticSourceChangesSnapshot level. @dnhatn had an interesting alternative idea, that involves caching last de-compressed bytes via a weak reference that gets reused if index input and file pointer matches. I think his idea solves the problem being experienced with ccr replication at the right level and could be useful of other stored field use cases too. We're running benchmarks to validate this approach.

…gesSnapshot_sequential_stored_reader

martijnvg · 2025-02-07T08:12:59Z

I noticed that many shard changes requests end up querying for monotonically increasing docids. So I did another iteration of this change and now sequential reader only gets used if docids are dense. This is similar to the implementation that gets for stored recovery source (LuceneChangesSnapshot).

To check the effect of this change I ran two benchmarks with the new elastic/logs logsdb with ccr track, the first is using stored recovery source (baseline in linked dashboard) and the second is with synthetic recovery source (contender in linked dashboard): https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/jbtVP

The results between stored and synthetic recovery source now look almost the same, which I think makes sense since both implementations now use the sequential reader in roughly the same number of times.

jimczi

That's very interesting! In this benchmark, source recovery is fast because it primarily relies on the sequential reader. However, in #114618, we mentioned that this scenario should be quite rare, so I’m curious, what changed? That said, it's reassuring to see that synthetic source recovery can be just as fast, even when using the sequential reader.

jimczi · 2025-02-07T13:18:44Z

server/src/main/java/org/elasticsearch/index/engine/LuceneSyntheticSourceChangesSnapshot.java

+
+                int[] nextDocIdArray = nextDocIds.toArray();
+                leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, nextDocIdArray);
+                leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), nextDocIdArray);


Another side effect of providing the array of document IDs is that some field loaders may choose to load their values eagerly. I don't see this as a problem, but I wanted to point out that we would lose this behavior if we implement the TODO above.

Agreed. I will update the TODO to include that perspective.

Note that some doc values loaders already apply this strategy when doc ids are provided and there is a single value per field:

elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/SortedSetDocValuesSyntheticFieldLoaderLayer.java

Line 55 in b6facb2

* The singleton optimization is mostly about looking up ordinals

martijnvg · 2025-02-07T14:26:46Z

However, in #114618, we mentioned that this scenario should be quite rare, so I’m curious, what changed?

I think that the recent indexed logs are aligning often well with the index sorting and then fetching recent operations by seqno ends up with a dense monologic increasing docids. Some of data streams have one or no unique host name, while others up to 50 or something like that. I suspect if host.name had higher cardinality then we couldn't use a sequential stored reader often. But this then would also apply for LuceneChangesSnapshot.

That said, it's reassuring to see that synthetic source recovery can be just as fast, even when using the sequential reader.

Agreed, that is a nice observation.

…gesSnapshot_sequential_stored_reader

dnhatn

LGTM. Thanks Martijn!

lkts · 2025-02-07T19:33:01Z

server/src/main/java/org/elasticsearch/index/engine/LuceneSyntheticSourceChangesSnapshot.java

@@ -9,6 +9,8 @@

 package org.elasticsearch.index.engine;

+import com.carrotsearch.hppc.IntArrayList;


Did you intend to use this namespace (seems odd since it comes from a testing tool)?

I think hppc may have originated from the randomized testing framework that both Elasticsearch and Lucene use today. However it is currently a standalone high performance primitive collection library: https://github.com/carrotsearch/hppc, which does have other production usage in Elasticsearch.

What about using the Lucene one: package org.apache.lucene.internal.hppc

The Lucene core library only forked a subset of the hppc primitive collection library and IntArrayList is included. But the javadocs says it forked from version 0.10.0 and Elasticsearch is uses version 0.8.1 of that library. I see most Elasticsearch usages of hppc use the dependency directly.

tlrx

LGTM, I only left a question

tlrx · 2025-02-10T11:40:59Z

server/src/main/java/org/elasticsearch/index/engine/LuceneSyntheticSourceChangesSnapshot.java

@@ -191,8 +193,25 @@ private Translog.Operation[] loadDocuments(List<SearchRecord> documentRecords) t
                    maxDoc = leafReaderContext.reader().maxDoc();
                } while (docRecord.docID() >= docBase + maxDoc);

-                leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, null);
-                leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), null);
+                // TODO: instead of building an array, consider just checking whether doc ids are dense.


There is some knowledge in the PR description and comments that would deserve to be captured in the code as a comment, explaining why we always provide the docs ids set.

tlrx · 2025-02-10T11:46:04Z

server/src/main/java/org/elasticsearch/index/engine/LuceneSyntheticSourceChangesSnapshot.java

+                // TODO: instead of building an array, consider just checking whether doc ids are dense.
+                // Note, field loaders then would lose the ability to optionally eagerly loading values.
+                IntArrayList nextDocIds = new IntArrayList();
+                for (int j = i; j < documentRecords.size(); j++) {


As far as I understand we're not increasing the complexity of the method by iterating on documentRecords again here (as we already iterates on documentRecords in the outer loop), because we only compute nextDocIds for 1 leaf reader. Can you confirm?

Yes, we only compute the docids for the current lead reader. If docid is higher then or equal to docbase + maxDoc then it means current document record belongs to the next leaf reader.

…gesSnapshot_sequential_stored_reader

Use sequential reader in LuceneSyntheticSourceChangesSnapshot

6eb67b2

martijnvg added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. :StorageEngine/Logs You know, for Logs labels Feb 4, 2025

elasticsearchmachine added the v9.1.0 label Feb 4, 2025

martijnvg commented Feb 4, 2025

View reviewed changes

iter

41d4e68

martijnvg added >non-issue v8.18.1 v8.19.0 v9.0.1 labels Feb 4, 2025

added comment

8e9436c

martijnvg marked this pull request as ready for review February 4, 2025 09:53

martijnvg requested review from jimczi, dnhatn and tlrx February 4, 2025 09:53

elasticsearchmachine added Team:StorageEngine Team:Distributed Indexing Meta label for Distributed Indexing team labels Feb 4, 2025

martijnvg mentioned this pull request Feb 4, 2025

Use synthetic recovery source by default if synthetic source is enabled #119110

Merged

martijnvg changed the title ~~Use sequential reader in LuceneSyntheticSourceChangesSnapshot~~ Use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot Feb 4, 2025

henningandersen reviewed Feb 4, 2025

View reviewed changes

Merge remote-tracking branch 'es/main' into LuceneSyntheticSourceChan…

99623e7

…gesSnapshot_sequential_stored_reader

martijnvg added 4 commits February 6, 2025 13:21

Merge remote-tracking branch 'es/main' into LuceneSyntheticSourceChan…

161a429

…gesSnapshot_sequential_stored_reader

pass down docids in order to determine whether to load sequential

1d0a210

iter

c6cc1e7

iter2

78f0b7e

jimczi approved these changes Feb 7, 2025

View reviewed changes

martijnvg added 2 commits February 7, 2025 15:38

Merge remote-tracking branch 'es/main' into LuceneSyntheticSourceChan…

0f8dca1

…gesSnapshot_sequential_stored_reader

iter3

9a7a3e7

martijnvg added the auto-backport Automatically create backport pull requests when merged label Feb 7, 2025

dnhatn approved these changes Feb 7, 2025

View reviewed changes

lkts reviewed Feb 7, 2025

View reviewed changes

tlrx approved these changes Feb 10, 2025

View reviewed changes

martijnvg added 2 commits February 10, 2025 13:26

Merge remote-tracking branch 'es/main' into LuceneSyntheticSourceChan…

447de83

…gesSnapshot_sequential_stored_reader

added comment

31b81c0

martijnvg changed the title ~~Use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot~~ Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot Feb 10, 2025

martijnvg enabled auto-merge (squash) February 10, 2025 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot #121636

Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot #121636

martijnvg commented Feb 4, 2025 •

edited

Loading

martijnvg Feb 4, 2025

henningandersen Feb 4, 2025

martijnvg Feb 4, 2025

elasticsearchmachine commented Feb 4, 2025

elasticsearchmachine commented Feb 4, 2025

tlrx commented Feb 4, 2025

martijnvg commented Feb 4, 2025

martijnvg commented Feb 4, 2025

jimczi commented Feb 4, 2025

martijnvg commented Feb 4, 2025

jimczi commented Feb 4, 2025

martijnvg commented Feb 4, 2025

henningandersen Feb 4, 2025 •

edited

Loading

martijnvg Feb 4, 2025

martijnvg commented Feb 4, 2025

martijnvg commented Feb 5, 2025

martijnvg commented Feb 7, 2025 •

edited

Loading

jimczi left a comment

jimczi Feb 7, 2025

martijnvg Feb 7, 2025

jimczi Feb 7, 2025

martijnvg commented Feb 7, 2025

dnhatn left a comment

lkts Feb 7, 2025

martijnvg Feb 10, 2025 •

edited

Loading

kkrik-es Feb 10, 2025

martijnvg Feb 10, 2025

tlrx left a comment

tlrx Feb 10, 2025

tlrx Feb 10, 2025

martijnvg Feb 10, 2025

		@@ -9,6 +9,8 @@

		package org.elasticsearch.index.engine;

		import com.carrotsearch.hppc.IntArrayList;

Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot #121636

Are you sure you want to change the base?

Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot #121636

Conversation

martijnvg commented Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Feb 4, 2025

elasticsearchmachine commented Feb 4, 2025

tlrx commented Feb 4, 2025

martijnvg commented Feb 4, 2025

martijnvg commented Feb 4, 2025

jimczi commented Feb 4, 2025

martijnvg commented Feb 4, 2025

jimczi commented Feb 4, 2025

martijnvg commented Feb 4, 2025

henningandersen Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Feb 4, 2025

martijnvg commented Feb 5, 2025

martijnvg commented Feb 7, 2025 • edited Loading

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Feb 7, 2025

dnhatn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Feb 4, 2025 •

edited

Loading

henningandersen Feb 4, 2025 •

edited

Loading

martijnvg commented Feb 7, 2025 •

edited

Loading

martijnvg Feb 10, 2025 •

edited

Loading