What to do about data handling and Elasticsearch #3

portante · 2022-10-17T15:13:31Z

    > @portante why does ElasticSearch not add something to prevent this problem?

s/ElasticSearch/Elasticsearch/

The Elasticsearch APIs offer a way for the indexer to provide a document ID to ensure the same record is not added multiple times.

Arcaflow is a general-purpose piping layer and should not force users to pick one way over another. If we do add such a change and Arcaflow grows beyond its use case, we will be inviting an increasing number of requests on what kind of features to support, and what other functionality should make it into this plugin. Such changes are invitations for feature creep and violate the single responsibility principle, whereas a unit of code should only have one reason for change.

This does not appear to be about feature creep, but about the responsibility this plugin is assuming in the one task it is taking on: properly indexing the given document into Elasticsearch. The plugin is the plugin, its behavior is well-defined, and it has nothing to do the nature of the Arcaflow engine, but how it is required to handled data for indexing into Elasticsearch.

Unless we define in the interface to the plugin that its input includes an identifier for the document to be indexed, thereby moving the responsibility to some other entity in the workflow, we need to create a unique identifier for the given document.

We can't have a blind indexer send data into Elasticsearch. I have debugged too many problems over the years of well intentioned code spamming a set of indexes because of this kind of problem.

A possible alternative to the plugin creating the ID is to have the metadata plugin generate a unique ID from the associated collected metadata, or for the workflow engine itself to provide a UUID for the execution instance which could be used as the UUID for the data, or contribute to building the UUID by combining it with the metadata.

There are very valid use cases, such as ETL jobs, that would be made impossible by this change.

Are you envisioning Arcaflow supporting ETL jobs in the future? That could invite feature creep. There are foundational elements of Arcaflow that could be used as a foundation for such an ETL system as well, but it will likely be a mistake to position Arcaflow for both use cases.

If we keep the role of the workflow engine and plugins to be collectors of the base data, treating it as immutable, then other ETLs can transform it later from the original immutable form into whatever they want and need.

Having the base operation generate a unique ID for data that is collected during a workflow operation, and considering that collected data immutable, is very powerful.

FWIW, Lucene indexing scales by leveraging immutable objects.

And ... we might want to consider NOT having an Elasticsearch plugin as part of a step for a workflow, but a plugin which sends to a message bus (Kafka, ZeroMQ, Logstash, etc.) where data transformations could take place using existing services.

On the Pbench side, we have found it to be a mistake to endow the workload running process with data transformation steps. It has created too much inflexibility.

Originally posted by @portante in #2 (comment)

The text was updated successfully, but these errors were encountered:

dustinblack · 2023-10-05T15:03:35Z

I know this is a year stale, but this may still be a pertinent issue for both this Elasticsearch plugin and for the Opensearch plugin.

IIUC the primary concern here is that the document being indexed should always have a unique identifier. It seems straightforward for the input schema to include a pointer to a key in the document to act as the ID, and perhaps if no pointer is provided, the plugin automatically appends a new ID key with a generated UUID value.

portante mentioned this issue Oct 17, 2022

Elasticsearch Plugin #2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do about data handling and Elasticsearch #3

What to do about data handling and Elasticsearch #3

portante commented Oct 17, 2022

dustinblack commented Oct 5, 2023

What to do about data handling and Elasticsearch #3

What to do about data handling and Elasticsearch #3

Comments

portante commented Oct 17, 2022

dustinblack commented Oct 5, 2023