Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Refactor on data validation and extraction from customer's documents in several processors #660

Closed
zane-neo opened this issue Apr 2, 2024 · 8 comments

Comments

@zane-neo
Copy link
Collaborator

zane-neo commented Apr 2, 2024

Is your feature request related to a problem?

Background

Currently in neural search, we validate and extract data from user's index documents and send them to model for inference. We used a recursive approach to check the data structure based on user's configuration. For example, below is the user's configuration for text_embedding and the field_map is the important part which represents a mapping relation between the original key and the target key(embedding field key).

{
    "text_embedding": {
        "model_id": "WYjkv4MBHcWxVq8Jtc8U",
        "field_map": {
            "title": "title_knn",
            "favor_list": "favor_list_knn",
            "favorites": {
                "game": "game_knn",
                "movie": "movie_knn"
            }
        }
    }
}

Above configuration assume user has original document in below structure:

{
    "title": "content of title", // raw string
    "favor_list": ["content of each element", "content of each element", ...], // list of string
    "favorites": {
        "game": "game content", // map type with leaf string type
        "movie": "movie content"
    }
}

We support raw string, map type with leaf string type, and list of string, list of map with leaf string type.

Problem statement

Several processors are using the same configuration to validate and extract the field content: InferenceProcessor, TextImageEmbeddingProcessor and TextChunkingProcessor etc which causes duplicate code among these classes.
And a more critical issue is when we need to implement new features for this validation and extraction, we need to duplicate that as well, e.g.: #110 this issue requested to add dot support for user's configuration, and we need to implement for this in multiple places which is a bad smell and it's time to refactor our code.

Proposal

The proposal is that we can extract these code in a common place and by designing them reasonably we can reduce the code duplication and all the enhancement goes a single place in future.

We should note that not every piece can be made common since different processor has different logic, e.g. TextImageEmbeddingProcessor and InferenceProcessor's buildxxxKeyAndOriginalValue, since they support different type, it's not easy to make this part reusable, so in this case we will have to make some abstract methods and different processor has their own different implementation. This case we need to add an abstract class and different processors need to extend it.

Some code are almost same, e.g. TextImageEmbeddingProcessor and InferenceProcessor and TextChunkProcessor's validateEmbeddingFieldsValue/validateFieldsValue method, they're almost exactly same with only minor differences, in this case we can extract the code to a common class a common method with adding slightly change and with combination approach we can reduce the code duplication.

What solution would you like?

A clear and concise description of what you want to happen.

What alternatives have you considered?

A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

Add any other context or screenshots about the feature request here.

@zane-neo
Copy link
Collaborator Author

zane-neo commented Apr 2, 2024

This is a very early idea of refactoring code doesn't mean above statement will be the final change, let's have more discussion under this issue especially related code owners, would like to hear your voice on this.

@navneet1v
Copy link
Collaborator

@zane-neo can we have a utility exposed from opensearch-core for this? @mingshl was also doing something similar for MLInference processor. I would recommend we coming up with a common utility that we can leverage across different processors

@zane-neo
Copy link
Collaborator Author

zane-neo commented Apr 2, 2024

@zane-neo can we have a utility exposed from opensearch-core for this? @mingshl was also doing something similar for MLInference processor. I would recommend we coming up with a common utility that we can leverage across different processors

That make sense since other processors might need same support for customer's configuration. We can consider this overall and make it more generic across different processors.

@zane-neo zane-neo changed the title [FEATURE] Refactor on data validation and extraction from customer's index documents in multiple processors [FEATURE] Refactor on data validation and extraction from customer's documents in several processors Apr 2, 2024
@zane-neo
Copy link
Collaborator Author

zane-neo commented Apr 2, 2024

Checked on @mingshl 's issue and PR again and it seems her proposal is targeted to a different case, simply put:
processor based on remote model connector with user defined painless scripts, so only remote model is able to use that. Our case is different since we're support both remote model and local model and doesn't base on connector configuration nor painless scripts, which is a more generic case that can be applied to even non inference processors. @mingshl please correct me if any misunderstanding.

@mingshl
Copy link

mingshl commented Apr 3, 2024

@zane-neo the dot path notation for nested object would be supported in ml inference processors.

we first support remote models, but the future development will include local models. The reason behind that is the input datasets for local models are planning to be unified similar to remote models, and it will be convenient to be supported for many features, including ml inference processors.

@mingshl
Copy link

mingshl commented Apr 3, 2024

Checked on @mingshl 's issue and PR again and it seems her proposal is targeted to a different case, simply put: processor based on remote model connector with user defined painless scripts, so only remote model is able to use that. Our case is different since we're support both remote model and local model and doesn't base on connector configuration nor painless scripts, which is a more generic case that can be applied to even non inference processors. @mingshl please correct me if any misunderstanding.

I think there are some diverges here, so in ml inference processors , the main responsibility is to call inference and add the results to the documents or search responses. The only required field is model_id, so the use cases have to be using models to inference. And the ml inference processors itself doesn't need to write painless script.

@zane-neo
Copy link
Collaborator Author

zane-neo commented Apr 7, 2024

Checked on @mingshl 's issue and PR again and it seems her proposal is targeted to a different case, simply put: processor based on remote model connector with user defined painless scripts, so only remote model is able to use that. Our case is different since we're support both remote model and local model and doesn't base on connector configuration nor painless scripts, which is a more generic case that can be applied to even non inference processors. @mingshl please correct me if any misunderstanding.

I think there are some diverges here, so in ml inference processors , the main responsibility is to call inference and add the results to the documents or search responses. The only required field is model_id, so the use cases have to be using models to inference. And the ml inference processors itself doesn't need to write painless script.

Yeah, painless scripts are part of the connector which is conforms to current remote inference implementation. The main point here is the ML inference processor is based on connector which is a ml-commons specific concept so this part isn't in the scope if we want a generic data structure parser in opensearch core.

@naveentatikonda
Copy link
Member

Closing this issue...feel free to open if there are any concerns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants