-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot #121636
base: main
Are you sure you want to change the base?
Conditionally use sequential stored field reader in LuceneSyntheticSourceChangesSnapshot #121636
Conversation
this.storedFieldLoader = StoredFieldLoader.create(false, storedFields); | ||
String codec = EngineConfig.INDEX_CODEC_SETTING.get(mapperService.getIndexSettings().getSettings()); | ||
boolean shouldForceSequentialReader = CodecService.BEST_COMPRESSION_CODEC.equals(codec); | ||
this.storedFieldLoader = StoredFieldLoader.create(false, storedFields, shouldForceSequentialReader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use always force a sequential reader. I think no matter the stored field format, it has benefits given how LuceneSyntheticSourceChangesSnapshot
access stored fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure. Does this not depend upon how jumpy the data is, which with index sorting could be very jumpy - and whether there is more than one stored field per doc (not sure that is always true?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is why then changed to logic not look at index.codec setting anymore.
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
@martijnvg looks like both links point to the same benchmark? |
@tlrx Oops, I've update the link. |
(I updated the description with more information) |
This feels a bit counterintuitive. Is it because the CCR buffer is big enough that reads end up being sequential? Also, is index sorting applied to the rally track? I’m trying to understand why this helps since we said index sorting would make sequential reads worse, yet this change applies it all the time. |
Yes, this is based on the elastic/logs track which sets index.mode to logsdb (which means index sort fields are
Reader is a misleading word in the title. This is just using a stored field reader implementation that decompresses an entire block at a time (This delegate to |
My comment was more that it's beneficial for a specific layout, the block of doc ids to retrieve is dense but here we apply it all the time. Performance in this mode might degrade when adding more hosts or when more diverse ingestion happens? |
For the elastic/logs track, some data streams have 0 or 1 host, some have a few and some up to 50 hosts iirc.
Right, if the buffer is small then maybe sequential stored field reader is the right choice.
You mean the docid gap between requested seqno? I will look into this.
Yes, the non-synthetic implementation selectively uses sequential stored field reader based on the range of requested docid. Like you said we can do that here too and be a little bit less strict. We can in the |
// | ||
// A sequential reader decompresses a block eagerly, so that increasing adjacent doc ids can access stored fields without | ||
// compressing on each StoredFields#document(docId) invocation. The only downside is the last few operations in the request | ||
// seq_no range are at the beginning of a block, which means stored fields for many docs are being decompressed that isn't used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true also for zstd? Or rather would that always decompress the entire thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it both true for zstd or deflate, since both de-compress an entire block.
…gesSnapshot_sequential_stored_reader
I'm printing average docid gap in |
One aspect of investigating the performance difference between stored and synthetic recovery source that isn't included in this PR history, is that ccr replication is so much slower with zstd compressed stored fields compared to previous default for best compression (deflate) (see results). The deflate stored field decompressor can skip decompressing unneeded blocks, whereas with the zstd decompressor that isn't the case. Given that all docids are being visited with ccr replication, a lot of blocks are being repetitively de-compressed. This is much heavier with the current default stored field codec with logsdb (zstd). This PR tries to address this at the |
…gesSnapshot_sequential_stored_reader
I noticed that many shard changes requests end up querying for monotonically increasing docids. So I did another iteration of this change and now sequential reader only gets used if docids are dense. This is similar to the implementation that gets for stored recovery source ( To check the effect of this change I ran two benchmarks with the new elastic/logs logsdb with ccr track, the first is using stored recovery source (baseline in linked dashboard) and the second is with synthetic recovery source (contender in linked dashboard): https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/jbtVP The results between stored and synthetic recovery source now look almost the same, which I think makes sense since both implementations now use the sequential reader in roughly the same number of times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very interesting! In this benchmark, source recovery is fast because it primarily relies on the sequential reader. However, in #114618, we mentioned that this scenario should be quite rare, so I’m curious, what changed? That said, it's reassuring to see that synthetic source recovery can be just as fast, even when using the sequential reader.
|
||
int[] nextDocIdArray = nextDocIds.toArray(); | ||
leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, nextDocIdArray); | ||
leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), nextDocIdArray); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another side effect of providing the array of document IDs is that some field loaders may choose to load their values eagerly. I don't see this as a problem, but I wanted to point out that we would lose this behavior if we implement the TODO above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I will update the TODO to include that perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that some doc values loaders already apply this strategy when doc ids are provided and there is a single value per field:
Line 55 in b6facb2
* The singleton optimization is mostly about looking up ordinals |
I think that the recent indexed logs are aligning often well with the index sorting and then fetching recent operations by seqno ends up with a dense monologic increasing docids. Some of data streams have one or no unique host name, while others up to 50 or something like that. I suspect if host.name had higher cardinality then we couldn't use a sequential stored reader often. But this then would also apply for
Agreed, that is a nice observation. |
…gesSnapshot_sequential_stored_reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Martijn!
@@ -9,6 +9,8 @@ | |||
|
|||
package org.elasticsearch.index.engine; | |||
|
|||
import com.carrotsearch.hppc.IntArrayList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to use this namespace (seems odd since it comes from a testing tool)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hppc may have originated from the randomized testing framework that both Elasticsearch and Lucene use today. However it is currently a standalone high performance primitive collection library: https://github.com/carrotsearch/hppc, which does have other production usage in Elasticsearch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using the Lucene one: package org.apache.lucene.internal.hppc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Lucene core library only forked a subset of the hppc primitive collection library and IntArrayList
is included. But the javadocs says it forked from version 0.10.0
and Elasticsearch is uses version 0.8.1
of that library. I see most Elasticsearch usages of hppc use the dependency directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I only left a question
@@ -191,8 +193,25 @@ private Translog.Operation[] loadDocuments(List<SearchRecord> documentRecords) t | |||
maxDoc = leafReaderContext.reader().maxDoc(); | |||
} while (docRecord.docID() >= docBase + maxDoc); | |||
|
|||
leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, null); | |||
leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), null); | |||
// TODO: instead of building an array, consider just checking whether doc ids are dense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some knowledge in the PR description and comments that would deserve to be captured in the code as a comment, explaining why we always provide the docs ids set.
// TODO: instead of building an array, consider just checking whether doc ids are dense. | ||
// Note, field loaders then would lose the ability to optionally eagerly loading values. | ||
IntArrayList nextDocIds = new IntArrayList(); | ||
for (int j = i; j < documentRecords.size(); j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand we're not increasing the complexity of the method by iterating on documentRecords
again here (as we already iterates on documentRecords
in the outer loop), because we only compute nextDocIds
for 1 leaf reader. Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we only compute the docids for the current lead reader. If docid is higher then or equal to docbase + maxDoc then it means current document record belongs to the next leaf reader.
…gesSnapshot_sequential_stored_reader
Improve LuceneSyntheticSourceChangesSnapshot by triggering to use a sequential stored field reader if docids are dense. This is done by computing for which docids to synthesize recovery source for. If the requested docids are dense and monotonic increasing a sequential stored field reader is used, which provided recovery source for many documents without repeatedly de-compressing the same block of stored fields.
Without a change like this, synthetic recovery source is ~2 times slower compared to stored recovery source. See this dashboard that shows total read time in follower cluster broken down by shard: https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/pFChT
Baseline is stored recovery source and contender is synthetic recovery source. The benchmark is performed using a new elastic/logs challenge (elastic/rally-tracks#734), that configures auto following of logs-* into the local cluster.
After investigation, the diff between operations read between baseline and contender was caused by decompressing stored fields over and over again:
(this happened, because for each operation, a stored field block gets de-compressed, but only stored fields for one document is read. The next op is very likely to de-compress the same stored field block)
The same benchmark with this change:
https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/kVgM2
With the change synthetic recovery source is more than 2 times faster compared to synthetic recovery source. This matches with what we observed in earlier ad-hoc / micro benchmarks.
Labelling this as a non-issue. This is a performance bug, but synthetic recovery source hasn't been released yet.