Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OS updates wiping knn_vector field when excluded from _source #1694

Closed
claire-chiu-figma opened this issue May 8, 2024 · 11 comments
Closed

Comments

@claire-chiu-figma
Copy link

claire-chiu-figma commented May 8, 2024

What is the bug?
I have an index with a knn_vector field that I excluded from _source. When I update a document in this index without specifying any value for the knn_vector field, the field gets wiped from the document (when I would expect the field to remain unchanged).

How can one reproduce the bug?

  1. Go to OS dashboards.
  2. Create an OS index with knn vector field:
PUT /test
{
  "settings": {
      "index": {
        "replication": {
          "type": "DOCUMENT"
        },
        "knn": "true"
      }
    },
    "mappings": {
      "dynamic": "strict",
      "_source": {
        "excludes": [
          "embedding"
        ]
      },
      "properties": {
        "embedding": {
          "type": "knn_vector",
          "dimension": 1,
          "method": {
            "engine": "faiss",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "creator_id": {
          "type": "keyword"
        },
        "file_id": {
          "type": "long"
        }
      }
    }

}
  1. Create a document in the index.
PUT /test/_doc/1
{
  "creator_id": "2",
  "file_id": 22,
  "embedding":[0.1]
}
  1. Check to see if the knn vector field (embedding) exists on the document (the below command should return the single document).
GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "embedding"
            }
        }
      ]
    }
  }
}
  1. Update the document with an update on the non-knn vector field.
POST test-v6/_update/1
{
  "doc": {
    "file_id": 24
  }
}
  1. Check to see if the document has exists with the knn_vector field by running the same command in step 4 (now returns no documents).

What is the expected behavior?
I would expect the knn_vector field to still exist after the update, because I have not made any changes to that field.

What is your host/environment?
Opensearch version 2.11, hosted on AWS

Do you have any additional context?
In an index where the knn_vector field (embedding) is NOT excluded from _source, this problem is not present.

@claire-chiu-figma claire-chiu-figma added bug Something isn't working untriaged labels May 8, 2024
@navneet1v
Copy link
Collaborator

navneet1v commented May 8, 2024

@claire-chiu-figma as you are not storing the _source, the updates will lead to removal of vector. This is not a bug, but this is what removal of _source will happen.

and as you can see with the other experiment: In an index where the knn_vector field (embedding) is NOT excluded from _source, this problem is not present.

when _source is not removed the field is not getting removed.

@navneet1v navneet1v removed bug Something isn't working untriaged labels May 8, 2024
@navneet1v
Copy link
Collaborator

We are working on another feature ref: #1571 where if you remove _source and then do updates too the vector field should not be removed.

@luyuncheng as you are the author of the PR. can you validate that removing the _source and then doing updates will work once your code is merged?

@claire-chiu-figma
Copy link
Author

claire-chiu-figma commented May 8, 2024

@navneet1v Thanks for the quick response -

This is not a bug, but this is what removal of _source will happen.

To further understand the implications of removing a field from _source - does this mean that for ANY field that is excluded from _source, when you run an update on the document, if the update does not specify a new value for that field, that field will get wiped? Why is that so?

@navneet1v
Copy link
Collaborator

if the update does not specify a new value for that field, that field will get wiped? Why is that so?

The reason is there is no way Opensearch has the way to recreate the whole document from scratch. The whole document gets stored in _source so if you remove it then update capability goes away.

@claire-chiu-figma
Copy link
Author

I see, and #1571 would help resolve this issue even if the vector is excluded from _source, because it would store the vector in docvalue_fields, which can be pulled from during an update operation?

@navneet1v
Copy link
Collaborator

I see, and #1571 would help resolve this issue even if the vector is excluded from _source, because it would store the vector in docvalue_fields, which can be pulled from during an update operation?

So vectors are already stored in doc_values. What the above PR will do is it will ensure that vectors get pulled from doc values if they are not present in _source. I would like @luyuncheng to comment more as he is author of the PR.

@claire-chiu-figma
Copy link
Author

@navneet1v apart from the PR that's being worked on, are there any other approaches to issuing partial updates without wiping the vector? Or is adding this back to _source the only option?

@navneet1v
Copy link
Collaborator

@claire-chiu-figma if you can create the whole source back again and use it in your update API that is the only way. Otherwise you have to enable the source.

@navneet1v
Copy link
Collaborator

@claire-chiu-figma can I go ahead and close this issue. As there is no bug and this is the expected behavior of the Opensearch.

@luyuncheng
Copy link
Collaborator

@navneet1v @claire-chiu-figma

when excluded from _source and do update operation, it goes to logic:

https://github.com/opensearch-project/OpenSearch/blob/14f1c43c108f378b13d109ade364216c082fb858/server/src/main/java/org/opensearch/index/engine/InternalEngine.java#L1311-L1318

it using lucene source to do update. as i know, in the original reference there is a warning that when exclude source, we can not use update, update_by_query, reindex APIs

and if we wan to use #1571 features, which is rewrite the FetchSubPhase, it can do reindex but not update the other field.

there is 2 scenarios:

  1. exclude vector, update vector field: OK
  2. exclude vector, update other field: Failed

@jmazanec15
Copy link
Member

Covered in #1572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants