[FEATURE]Combine text chunking and text embedding output to a single nested field #1177

Arukris · 2025-02-07T14:36:00Z

Is your feature request related to a problem?

Current OpenSearch implementation presents a limitation in correlating vector embeddings with their source text chunks. When documents are split into chunks and processed for vector embeddings, these elements are stored in separate nested fields. This structure makes it impossible to retrieve the original text chunk that corresponds to a matching vector embedding in search results. A solution is needed to maintain the relationship between embeddings and their source chunks.

What solution would you like?

Implement a unified chunk_and_embed processor that integrates text chunking and embedding functionalities, outputting a single nested field containing both the text chunk and its corresponding vector embedding.

The combined field is to follow the below schema
PUT /testindex
{
"mappings": {
"properties": {
"chunked_and_embedded": {
"type": "nested",
"properties": {
"raw_chunks": {
"type": "text"
},
"chunk_embedding": {
"type": "knn_vector"
}
}
}
}
}
}

What alternatives have you considered?

Chunk and embed outside of OpenSearch to create a single nested field with 2 properties. I need to validate to see if the "inner_hits": {} could work

Do you have any additional context?

We could also look into leveraging this feature to do the semantic highlighting in OpenSearch , highlight the right chunk when there is a semantic match

heemin32 · 2025-02-07T17:51:37Z

@Arukris Thank you for proposing this feature. While it will increase storage usage, it could be valuable for users who want to identify which chunk is matched during a search.

yuye-aws · 2025-02-08T02:04:40Z

@Arukris It's a valid use case. I think the feature request is target towards retrieve the original text chunk that corresponds to a matching vector embedding in search results. I'll create an RFC with a few options, including your proposed solution.

yuye-aws · 2025-02-08T02:05:00Z

While it will increase storage usage

This could be mitigated by a new feature flag

heemin32 · 2025-02-08T02:15:23Z

@yuye-aws Are you going to work on this feature?

yuye-aws · 2025-02-08T03:19:16Z

@yuye-aws Are you going to work on this feature?

I'll create an RFC, but to be honest I do not have enough bandwidth to implement them. Don't worry, I can still watch these issues.

cxclark · 2025-02-11T16:42:27Z

This would be an extremely useful feature, especially in evaluating relevance for extremely long documents with multiple vectors stored in a nested field. Would be great to be able to return the text chunk corresponding to the matching vector in inner_hits.

yuye-aws · 2025-02-16T08:27:11Z

Hi @Arukris ! I think your issue is one of the solutions to the actual need for neural-search users, which is to retrieved the specific chunk from the nested document. On top of that, I have created a feature request: #1188. There may exist some other solutions, we can discuss the pros and cons later.

Arukris added enhancement untriaged labels Feb 7, 2025

heemin32 added this to Neural Search RoadMap Feb 7, 2025

heemin32 removed the untriaged label Feb 7, 2025

heemin32 moved this to Backlog(Hot) in Neural Search RoadMap Feb 7, 2025

yuye-aws mentioned this issue Feb 16, 2025

[FEATURE] Retrieve specific chunk from chunked documents #1188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]Combine text chunking and text embedding output to a single nested field #1177

[FEATURE]Combine text chunking and text embedding output to a single nested field #1177

Arukris commented Feb 7, 2025 •

edited

Loading

heemin32 commented Feb 7, 2025

yuye-aws commented Feb 8, 2025

yuye-aws commented Feb 8, 2025 •

edited

Loading

heemin32 commented Feb 8, 2025

yuye-aws commented Feb 8, 2025

cxclark commented Feb 11, 2025

yuye-aws commented Feb 16, 2025

[FEATURE]Combine text chunking and text embedding output to a single nested field #1177

[FEATURE]Combine text chunking and text embedding output to a single nested field #1177

Comments

Arukris commented Feb 7, 2025 • edited Loading

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

heemin32 commented Feb 7, 2025

yuye-aws commented Feb 8, 2025

yuye-aws commented Feb 8, 2025 • edited Loading

heemin32 commented Feb 8, 2025

yuye-aws commented Feb 8, 2025

cxclark commented Feb 11, 2025

yuye-aws commented Feb 16, 2025

Arukris commented Feb 7, 2025 •

edited

Loading

yuye-aws commented Feb 8, 2025 •

edited

Loading