Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Mapping fashion configuration for pipeline processors #13128

Closed
zane-neo opened this issue Apr 9, 2024 · 1 comment
Closed
Labels
enhancement Enhancement or improvement to existing feature or request Plugins untriaged

Comments

@zane-neo
Copy link
Contributor

zane-neo commented Apr 9, 2024

Is your feature request related to a problem? Please describe

Background

Current OpenSearch Core support field value configuration in multiple processors, e.g. AppendProcessor, SetProcessor etc. An example like below:

{
  "append": {
    "field": "your_target_field",
    "value": "{{{tenure}}}"
  }
}

Usually these processors have fixed key: field and value: value represents an operation of value to either an existing key or new key in the document.
But in neural search plugin, we need another pattern: we need to map an existing key to a new key. E.g.

"title": "title_knn"

means to map the title in the document to a new key title_knn which is generated by extra logic. Also, we need to support complex nested object configurations to map multiple fields in one document, an example looks like below:

{
    "text_embedding": {
        "model_id": "WYjkv4MBHcWxVq8Jtc8U",
        "field_map": {
            "title": "title_knn",
            "todo_list": "todo_list_knn",
            "favorites": {
                "game": "game_knn",
                "movie": "movie_knn"
            }
        }
    }
}

Problem statement

As more and more processors need the multiple fields mapping configuration, and usually this scenario involves data validation and extraction, which is a pretty common logic across different processors. In neural search, several processors has similar data validation and extraction logic, e.g. InferenceProcessor, TextImageEmbeddingProcessor and TextChunkingProcessor. And the main problems are:

  1. Validation and extraction code across different processors even different plugins are similar but not reused.
  2. Any enhancement to the validation and extraction logic needs duplicated in different processors.

Describe the solution you'd like

We can support mapping configuration in opensearch core so that it can be reused in different processors across different plugins. By moving the text_embedding’s json style configuration to OpenSearch Core, we can make the validation and extraction logic reusable. Beside, we should also support dotted fashion configuration to make it easier for users, e.g.:

{
  "field_map": {
    "title": "title_knn",
    "todo_list": "todo_list_knn",
    "favorites.game": "favorites.game_knn",
    "favorites.movie": "favorites.movie_knn"
  }
}

We can create a Util class which is similar to ConfigurationUtils and with this util, different processors and plugins can use the default methods in it or override with their own requirements.

Related component

Plugins

Describe alternatives you've considered

No response

Additional context

opensearch-project/neural-search#660

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6]
@zane-neo Thanks for creating this issue; however, it isn't being accepted due to not having a clear outcome - without more details, can you please rewrite the issue so it is more approachable to OpenSearch developers that are not familiar with the space. Please feel free to open a new issue after addressing the reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Plugins untriaged
Projects
None yet
Development

No branches or pull requests

2 participants