Add pipeline to clean docs during data stream reindex #121617

parkertimmins · 2025-02-04T02:13:59Z

It is possible for documents in a data stream to not have a @timestamp field. For example, if an existing index is added to a data stream. But, this will cause the reindex operation to fail as the destination index will already contained the _data_stream_timestamp value from the source mappings, causing each document to be checked for an @timestamp field.

This change adds a pipeline, tentatively called reindex-data-stream, which adds a @timestamp field with a value of 0 to destination docs if a timestamp is missing. If a user creates a pipeline with this name, and without a version field, the user's pipeline will be used instead of the built in pipeline.

.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java

elasticsearchmachine · 2025-02-04T15:06:43Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-02-04T15:07:49Z

Hi @parkertimmins, I've created a changelog YAML for you.

masseyke · 2025-02-04T16:07:37Z

.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java

+        final BytesReference pipeline = BytesReference.bytes(currentPipelineDefinition());
+        client.execute(
+            PutPipelineTransportAction.TYPE,
+            new PutPipelineRequest(


We probably ought to set the parent task id here so that it's cancellable (although I'm not 100% sure it's worth it).

masseyke · 2025-02-04T16:09:26Z

.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java

+                {
+                    builder.startObject();
+                    {
+                        builder.startObject("set");


Is this going to set every @timestamp to 0? Shouldn't this be a script processor that checks if it exists first?

Oh never mind! This is what override is for.

It would be interesting to compare which performs better: a set with override: false or a set with an if condition checking for a null @timestamp

It's hard to imagine that a script would be faster, but it would be interesting. We could always change that later though.

And if the script is faster, that would probably shame @joegallo into action, making the set processor faster.

masseyke · 2025-02-04T16:31:40Z

Looks pretty good, but a couple of things remain:

The REINDEX_DATA_STREAM_USER needs a new cluster privilege for creating the pipeline
Once you merge in main, DataStreamsUpgradeIT will begin failing because now we no longer expect failures when there is no @timestamp field.

dakrone

this generally looks good, but I'm curious why we don't use our existing infrastructure for installing pipelines?

.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java

dakrone · 2025-02-04T16:55:45Z

.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java

+                {
+                    builder.startObject();
+                    {
+                        builder.startObject("set");


It would be interesting to compare which performs better: a set with override: false or a set with an if condition checking for a null @timestamp

parkertimmins · 2025-02-04T19:45:21Z

...ck/plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/MigrateTemplateRegistry.java

+    }
+
+    @Override
+    protected String getOrigin() {


Since this no longer needs to be run with the user permissions, it seemed better to not require the user to have put-pipeline, and to make new user with system perms (or something like that) and only give it to this registry.

I definitely agree.

parkertimmins · 2025-02-04T19:46:42Z

...est/java/org/elasticsearch/xpack/migrate/action/ReindexDatastreamIndexTransportActionIT.java

+                      }
+                    }
+                  ],
+                  "version": 1000


Since version is now handled by the index template registry, the way to keep it from overwriting a custom template is to use a higher version number.

masseyke

LGTM

dakrone

LGTM, I left one comment about which origin to use

dakrone · 2025-02-04T21:14:49Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ClientHelper.java

@@ -196,6 +196,7 @@ private static String maybeRewriteSingleAuthenticationHeaderForVersion(
    public static final String APM_ORIGIN = "apm";
    public static final String OTEL_ORIGIN = "otel";
    public static final String REINDEX_DATA_STREAM_ORIGIN = "reindex_data_stream";
+    public static final String MIGRATE_ORIGIN = "migrate";


Let's use the existing STACK_ORIGIN origin, I think there's an aversion to adding too many of these if I remember correctly.

Add data stream reindex pipeline to sanitize docs

b3e2466

parkertimmins added :Data Management/Data streams Data streams and their lifecycles auto-backport Automatically create backport pull requests when merged v9.0.0 v8.18.0 v8.18.1 v9.0.1 labels Feb 4, 2025

elasticsearchmachine added the v9.1.0 label Feb 4, 2025

parkertimmins commented Feb 4, 2025

View reviewed changes

.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java Outdated Show resolved Hide resolved

parkertimmins commented Feb 4, 2025

View reviewed changes

.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java Outdated Show resolved Hide resolved

Merge branch 'main' into reindex-data-stream-pipeline

68d8e77

parkertimmins marked this pull request as ready for review February 4, 2025 15:06

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 4, 2025

parkertimmins added the >bug label Feb 4, 2025

Update docs/changelog/121617.yaml

7916c28

masseyke requested review from masseyke and dakrone February 4, 2025 15:15

masseyke reviewed Feb 4, 2025

View reviewed changes

parkertimmins added 2 commits February 4, 2025 10:56

review feedback

aefea5b

Merge branch 'main' into reindex-data-stream-pipeline

369aec9

dakrone reviewed Feb 4, 2025

View reviewed changes

parkertimmins added 2 commits February 4, 2025 11:15

Now that fix is in, change upgrade test to not expect errors

0d444d9

Replace adhoc pipeline registration with IndexTemplateRegistry

c8178ba

parkertimmins commented Feb 4, 2025

View reviewed changes

[CI] Auto commit changes from spotless

0b26eec

masseyke approved these changes Feb 4, 2025

View reviewed changes

dakrone approved these changes Feb 4, 2025

View reviewed changes

parkertimmins added 2 commits February 4, 2025 15:46

use stack origin instead of making new migrate origin

7fc1a0f

Merge branch 'main' into reindex-data-stream-pipeline

f2356b8

parkertimmins added >non-issue and removed >bug labels Feb 4, 2025

Delete docs/changelog/121617.yaml

4d6cd6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pipeline to clean docs during data stream reindex #121617

Add pipeline to clean docs during data stream reindex #121617

parkertimmins commented Feb 4, 2025

elasticsearchmachine commented Feb 4, 2025

elasticsearchmachine commented Feb 4, 2025

masseyke Feb 4, 2025

masseyke Feb 4, 2025

masseyke Feb 4, 2025

dakrone Feb 4, 2025

masseyke Feb 4, 2025

masseyke Feb 4, 2025

masseyke commented Feb 4, 2025

dakrone left a comment

dakrone Feb 4, 2025

parkertimmins Feb 4, 2025

masseyke Feb 4, 2025

parkertimmins Feb 4, 2025

masseyke left a comment

dakrone left a comment

dakrone Feb 4, 2025

Add pipeline to clean docs during data stream reindex #121617

Are you sure you want to change the base?

Add pipeline to clean docs during data stream reindex #121617

Conversation

parkertimmins commented Feb 4, 2025

elasticsearchmachine commented Feb 4, 2025

elasticsearchmachine commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masseyke commented Feb 4, 2025

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masseyke left a comment

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment