Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]Combine text chunking and text embedding output to a single nested field #1177

Open
Arukris opened this issue Feb 7, 2025 · 7 comments

Comments

@Arukris
Copy link

Arukris commented Feb 7, 2025

Is your feature request related to a problem?

Current OpenSearch implementation presents a limitation in correlating vector embeddings with their source text chunks. When documents are split into chunks and processed for vector embeddings, these elements are stored in separate nested fields. This structure makes it impossible to retrieve the original text chunk that corresponds to a matching vector embedding in search results. A solution is needed to maintain the relationship between embeddings and their source chunks.

What solution would you like?

Implement a unified chunk_and_embed processor that integrates text chunking and embedding functionalities, outputting a single nested field containing both the text chunk and its corresponding vector embedding.

The combined field is to follow the below schema
PUT /testindex
{
"mappings": {
"properties": {
"chunked_and_embedded": {
"type": "nested",
"properties": {
"raw_chunks": {
"type": "text"
},
"chunk_embedding": {
"type": "knn_vector"
}
}
}
}
}
}

What alternatives have you considered?

Chunk and embed outside of OpenSearch to create a single nested field with 2 properties. I need to validate to see if the "inner_hits": {} could work

Do you have any additional context?

We could also look into leveraging this feature to do the semantic highlighting in OpenSearch , highlight the right chunk when there is a semantic match

@heemin32
Copy link
Collaborator

heemin32 commented Feb 7, 2025

@Arukris Thank you for proposing this feature. While it will increase storage usage, it could be valuable for users who want to identify which chunk is matched during a search.

@yuye-aws
Copy link
Member

yuye-aws commented Feb 8, 2025

@Arukris It's a valid use case. I think the feature request is target towards retrieve the original text chunk that corresponds to a matching vector embedding in search results. I'll create an RFC with a few options, including your proposed solution.

@yuye-aws
Copy link
Member

yuye-aws commented Feb 8, 2025

While it will increase storage usage

This could be mitigated by a new feature flag

@heemin32
Copy link
Collaborator

heemin32 commented Feb 8, 2025

@yuye-aws Are you going to work on this feature?

@yuye-aws
Copy link
Member

yuye-aws commented Feb 8, 2025

@yuye-aws Are you going to work on this feature?

I'll create an RFC, but to be honest I do not have enough bandwidth to implement them. Don't worry, I can still watch these issues.

@cxclark
Copy link

cxclark commented Feb 11, 2025

This would be an extremely useful feature, especially in evaluating relevance for extremely long documents with multiple vectors stored in a nested field. Would be great to be able to return the text chunk corresponding to the matching vector in inner_hits.

@yuye-aws
Copy link
Member

Hi @Arukris ! I think your issue is one of the solutions to the actual need for neural-search users, which is to retrieved the specific chunk from the nested document. On top of that, I have created a feature request: #1188. There may exist some other solutions, we can discuss the pros and cons later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog(Hot)
Development

No branches or pull requests

4 participants