-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]Combine text chunking and text embedding output to a single nested field #1177
Comments
@Arukris Thank you for proposing this feature. While it will increase storage usage, it could be valuable for users who want to identify which chunk is matched during a search. |
@Arukris It's a valid use case. I think the feature request is target towards retrieve the original text chunk that corresponds to a matching vector embedding in search results. I'll create an RFC with a few options, including your proposed solution. |
This could be mitigated by a new feature flag |
@yuye-aws Are you going to work on this feature? |
I'll create an RFC, but to be honest I do not have enough bandwidth to implement them. Don't worry, I can still watch these issues. |
This would be an extremely useful feature, especially in evaluating relevance for extremely long documents with multiple vectors stored in a nested field. Would be great to be able to return the text chunk corresponding to the matching vector in inner_hits. |
Hi @Arukris ! I think your issue is one of the solutions to the actual need for neural-search users, which is to retrieved the specific chunk from the nested document. On top of that, I have created a feature request: #1188. There may exist some other solutions, we can discuss the pros and cons later. |
Is your feature request related to a problem?
Current OpenSearch implementation presents a limitation in correlating vector embeddings with their source text chunks. When documents are split into chunks and processed for vector embeddings, these elements are stored in separate nested fields. This structure makes it impossible to retrieve the original text chunk that corresponds to a matching vector embedding in search results. A solution is needed to maintain the relationship between embeddings and their source chunks.
What solution would you like?
Implement a unified chunk_and_embed processor that integrates text chunking and embedding functionalities, outputting a single nested field containing both the text chunk and its corresponding vector embedding.
The combined field is to follow the below schema
PUT /testindex
{
"mappings": {
"properties": {
"chunked_and_embedded": {
"type": "nested",
"properties": {
"raw_chunks": {
"type": "text"
},
"chunk_embedding": {
"type": "knn_vector"
}
}
}
}
}
}
What alternatives have you considered?
Chunk and embed outside of OpenSearch to create a single nested field with 2 properties. I need to validate to see if the "inner_hits": {} could work
Do you have any additional context?
We could also look into leveraging this feature to do the semantic highlighting in OpenSearch , highlight the right chunk when there is a semantic match
The text was updated successfully, but these errors were encountered: