-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pipeline to clean docs during data stream reindex #121617
base: main
Are you sure you want to change the base?
Add pipeline to clean docs during data stream reindex #121617
Conversation
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @parkertimmins, I've created a changelog YAML for you. |
final BytesReference pipeline = BytesReference.bytes(currentPipelineDefinition()); | ||
client.execute( | ||
PutPipelineTransportAction.TYPE, | ||
new PutPipelineRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably ought to set the parent task id here so that it's cancellable (although I'm not 100% sure it's worth it).
{ | ||
builder.startObject(); | ||
{ | ||
builder.startObject("set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to set every @timestamp
to 0? Shouldn't this be a script processor that checks if it exists first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh never mind! This is what override
is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to compare which performs better: a set
with override: false
or a set
with an if
condition checking for a null @timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to imagine that a script would be faster, but it would be interesting. We could always change that later though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if the script is faster, that would probably shame @joegallo into action, making the set processor faster.
Looks pretty good, but a couple of things remain:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this generally looks good, but I'm curious why we don't use our existing infrastructure for installing pipelines?
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
{ | ||
builder.startObject(); | ||
{ | ||
builder.startObject("set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to compare which performs better: a set
with override: false
or a set
with an if
condition checking for a null @timestamp
} | ||
|
||
@Override | ||
protected String getOrigin() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this no longer needs to be run with the user permissions, it seemed better to not require the user to have put-pipeline, and to make new user with system perms (or something like that) and only give it to this registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree.
} | ||
} | ||
], | ||
"version": 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since version is now handled by the index template registry, the way to keep it from overwriting a custom template is to use a higher version number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left one comment about which origin to use
@@ -196,6 +196,7 @@ private static String maybeRewriteSingleAuthenticationHeaderForVersion( | |||
public static final String APM_ORIGIN = "apm"; | |||
public static final String OTEL_ORIGIN = "otel"; | |||
public static final String REINDEX_DATA_STREAM_ORIGIN = "reindex_data_stream"; | |||
public static final String MIGRATE_ORIGIN = "migrate"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the existing STACK_ORIGIN
origin, I think there's an aversion to adding too many of these if I remember correctly.
It is possible for documents in a data stream to not have a
@timestamp
field. For example, if an existing index is added to a data stream. But, this will cause the reindex operation to fail as the destination index will already contained the_data_stream_timestamp
value from the source mappings, causing each document to be checked for an@timestamp
field.This change adds a pipeline, tentatively called
reindex-data-stream
, which adds a@timestamp
field with a value of0
to destination docs if a timestamp is missing. If a user creates a pipeline with this name, and without aversion
field, the user's pipeline will be used instead of the built in pipeline.