Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do about data handling and Elasticsearch #3

Open
portante opened this issue Oct 17, 2022 · 1 comment
Open

What to do about data handling and Elasticsearch #3

portante opened this issue Oct 17, 2022 · 1 comment

Comments

@portante
Copy link

    > @portante why does ElasticSearch not add something to prevent this problem?

s/ElasticSearch/Elasticsearch/

The Elasticsearch APIs offer a way for the indexer to provide a document ID to ensure the same record is not added multiple times.

Arcaflow is a general-purpose piping layer and should not force users to pick one way over another. If we do add such a change and Arcaflow grows beyond its use case, we will be inviting an increasing number of requests on what kind of features to support, and what other functionality should make it into this plugin. Such changes are invitations for feature creep and violate the single responsibility principle, whereas a unit of code should only have one reason for change.

This does not appear to be about feature creep, but about the responsibility this plugin is assuming in the one task it is taking on: properly indexing the given document into Elasticsearch. The plugin is the plugin, its behavior is well-defined, and it has nothing to do the nature of the Arcaflow engine, but how it is required to handled data for indexing into Elasticsearch.

Unless we define in the interface to the plugin that its input includes an identifier for the document to be indexed, thereby moving the responsibility to some other entity in the workflow, we need to create a unique identifier for the given document.

We can't have a blind indexer send data into Elasticsearch. I have debugged too many problems over the years of well intentioned code spamming a set of indexes because of this kind of problem.

A possible alternative to the plugin creating the ID is to have the metadata plugin generate a unique ID from the associated collected metadata, or for the workflow engine itself to provide a UUID for the execution instance which could be used as the UUID for the data, or contribute to building the UUID by combining it with the metadata.

There are very valid use cases, such as ETL jobs, that would be made impossible by this change.

Are you envisioning Arcaflow supporting ETL jobs in the future? That could invite feature creep. There are foundational elements of Arcaflow that could be used as a foundation for such an ETL system as well, but it will likely be a mistake to position Arcaflow for both use cases.

If we keep the role of the workflow engine and plugins to be collectors of the base data, treating it as immutable, then other ETLs can transform it later from the original immutable form into whatever they want and need.

Having the base operation generate a unique ID for data that is collected during a workflow operation, and considering that collected data immutable, is very powerful.

FWIW, Lucene indexing scales by leveraging immutable objects.

And ... we might want to consider NOT having an Elasticsearch plugin as part of a step for a workflow, but a plugin which sends to a message bus (Kafka, ZeroMQ, Logstash, etc.) where data transformations could take place using existing services.

On the Pbench side, we have found it to be a mistake to endow the workload running process with data transformation steps. It has created too much inflexibility.

Originally posted by @portante in #2 (comment)

@dustinblack
Copy link
Member

I know this is a year stale, but this may still be a pertinent issue for both this Elasticsearch plugin and for the Opensearch plugin.

IIUC the primary concern here is that the document being indexed should always have a unique identifier. It seems straightforward for the input schema to include a pointer to a key in the document to act as the ID, and perhaps if no pointer is provided, the plugin automatically appends a new ID key with a generated UUID value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

2 participants