Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] error on complex types list type field [category] has empty string, cannot process it #678

Closed
toyaokeke opened this issue Apr 9, 2024 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@toyaokeke
Copy link

toyaokeke commented Apr 9, 2024

Initial bug reported in opensearch-project/ml-commons#2303

What is the bug?
I am creating a text embedding processor that creates vectors on a nested field. However, I receive illegal_argument_exception because not all the fields in the object meet the requirement

  • string
  • map
  • string list

Here is the explanation from the AWS support specialist

Our internal team informed me that this exception happened when the “id” under “brand” field has int value that is not supported by the text embedding processor from ingestion pipeline, and the fields inside the complex type must be of types: string, map or list.

However, I am not creating vectors on id so I don't understand why it must follow these requirements. Is this expected behaviour or is this a bug?

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. create ingest pipeline
PUT /_ingest/pipeline/neural-search-pipeline-v2
{
  "description": "An example neural search pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "WeliNowB6EaQJ_XFf05V",
        "field_map": {
          "category": {
            "name": {
              "en": "category_name_vector"
            }
          }
        }
      }
    }
  ]
}
  1. simulate ingest pipeline
POST _ingest/pipeline/neural-search-pipeline-v2/_simulate
{
  "docs": [
    {
      "_index": "neural-search-index-v2",
      "_id": "1",
      "_source": {
        "category": {
          "id": 1,
          "name": {
            "en": "category 1"
          }
        }
      }
    }
  ]
}

What is the expected behavior?
should create vectors on category name

{
    "docs": [
      {
        "doc": {
          "_index": "neural-search-index-v2",
          "_id": "1",
          "_source": {
            "category": {
              "name": {
                "category_name_vector": [
                  0.019107267,
                  -0.029297447,
                  0.0070927013,
                  -0.022105217,
                  ...
                ],
                "en": "category 1"
              },
              "id": 1
            }
          },
          "_ingest": {
            "timestamp": "2024-01-08T17:59:39.543401762Z"
          }
        }
      }
    ]
  }

What is your host/environment?

  • OS: AWS Opensearch Service Managed Cluster
  • Version 2.11

Do you have any screenshots?

{
   "failures": {
        "index": "neural-search-index-v2",
        "id": "5302821",
        "cause": {
          "type": "illegal_argument_exception",
          "reason": "list type field [category] has empty string, cannot process it"
        },
        "status": 400
   },
   ...
}

Do you have any additional context?

invalid doc

{
   "brand": {
      "id": 123, // cannot be integer
      "description": {
         "en": "en description female",
         "fr": "" // cannot be empty string
      }
      ...
   },
   "category": {
      "id": "123", // valid string
      "sizes": [
         "XS",
         "XL",
         "", // elements in list cannot be empty strings
         123 // elements in list cannot be integers
         ...
      ]
   }
}

valid doc

{
   "brand": {
      "id": "123",
      "description": {
         "en": "en description"
      }
      ...
   },
   "category": {
      "id": "123",
      "sizes": [ ] // empty list is valid
      "description": {
         // empty object is valid
      }
   }
}
@navneet1v
Copy link
Collaborator

@zane-neo can you look into this issue?

@zane-neo
Copy link
Collaborator

zane-neo commented Apr 10, 2024

@toyaokeke From your example, I see two different cases:

  1. category is a map type instead of list:
{
  "category": {
    "id": 1,
    "name": {
      "en": "category 1"
    }
  }
}
  1. category doesn't have name.en:
"category": {
      "id": "123",
      "sizes": [ ] // empty list is valid
   }

Can you confirm which case is your real production case?

@zane-neo
Copy link
Collaborator

I did found issue in code, but the error message seems differ with yours, I created this ticket to track the issue: opensearch-project/ml-commons#2309. Still trying to understand your case to see if there's other issues here.

@toyaokeke
Copy link
Author

@toyaokeke From your example, I see two different cases:

  1. category is a map type instead of list:

{

  "category": {

    "id": 1,

    "name": {

      "en": "category 1"

    }

  }

}

  1. category doesn't have name.en:

"category": {

      "id": "123",

      "sizes": [ ] // empty list is valid

   }

Can you confirm which case is your real production case?

Hi @zane-neo and thank you for looking into this.

The invalid doc and valid doc examples in my description are all possible scenarios.

  • The empty strings in an array does not occur often, but is possible
  • name.zh missing is also possible. Sometimes not all documents have translations for all supported languages in my production environment

@toyaokeke
Copy link
Author

toyaokeke commented Apr 10, 2024

Also the ticket you created I also see that error as well when I run the simulator too

opensearch-project/ml-commons#2309

  • when category.id is a number, the error you mentioned occurs
  • when category.desciption.en for example is an empty string, I get the error in this ticket. Which is unexpected since the text embedding is on category.name, not category.description

The main issue with both is that I am creating embeddings on category.name.en, which does exist. However other fields that I am not creating embeddings on are causing these errors.

@zane-neo
Copy link
Collaborator

zane-neo commented Apr 11, 2024

@toyaokeke , would like to double confirm on the category data structure, it's a map type instead of list, right?

{
  "category": {
    "name": {
      ...
    }
  }
}

NOT

{
  "category": [
    "name": {
      ...
    }
  ]
}

The reason asking this is because if it's list type, it would need much more complex fix, if it's a map type the fix would be easier, I created this issue to track list of map issue: #686

@toyaokeke
Copy link
Author

@toyaokeke , would like to double confirm on the category data structure, it's a map type instead of list, right?


{

  "category": {

    "name": {

      ...

    }

  }

}

NOT


{

  "category": [

    "name": {

      ...

    }

  ]

}

The reason asking this is because if it's list type, it would need much more complex fix, if it's a map type the fix would be easier.

Correct it is a map type

@mingshl
Copy link

mingshl commented May 7, 2024

@toyaokeke the new ml_inference processor support nested object type. check out this unit test.

you can try using ml_inference processor in 2.14 version and see if that can solve your issue. opensearch-project/documentation-website#7095

I am still working on the tutorial for using ml inference processors using neural search query, will notice here once I have it

@toyaokeke
Copy link
Author

Hi @mingshl thank you for directing me to this! If I understand correctly this ml_inference processor supports nested type, but is only available in 2.14?

I am using AWS Managed Service, and it currently only supports up to 2.11. I would be more than happy to test that processor once AWS releases support for that version 🙏🏿

@toyaokeke
Copy link
Author

@mingshl considering what you shared, is this a bug that will still be fixed for the text_embedding processor, or are users encouraged to switch to the ml_inference processor for nested fields?

@toyaokeke
Copy link
Author

hello @zane-neo , I saw this PR was recently merged. Just wanted to confirmed has a fix been merged in that case and if you have an idea which version it will be released in?

@toyaokeke
Copy link
Author

Hi @zane-neo, just checking in to see if this bug has been resolved and can be closed?

@naveentatikonda
Copy link
Member

@zane-neo can you pls validate and confirm if the bug has been fixed ? Thanks!

@toyaokeke
Copy link
Author

toyaokeke commented Sep 24, 2024

as part of the fix, I was also wondering if more detail could be provided in the error message? for example, which field within a nested attribute is causing the error?

for example,

{
   "failures": {
        "index": "neural-search-index-v2",
        "id": "5302821",
        "cause": {
          "type": "illegal_argument_exception",
          "reason": "list type field [category] has empty string, cannot process it"
        },
        "status": 400
   },
   ...
}

I do not know which field within [category] (e.g. name, description) is causing the error unless I look through the document myself. it would be great if the error mentioned for example

{
   "failures": {
        "index": "neural-search-index-v2",
        "id": "5302821",
        "cause": {
          "type": "illegal_argument_exception",
          "reason": "[name] field within [category] within ... [rootEntity] entity has empty string, cannot process it"
        },
        "status": 400
   },
   ...
}

@jmazanec15
Copy link
Member

cc : @model-collapse

@zane-neo
Copy link
Collaborator

zane-neo commented Oct 9, 2024

@toyaokeke Sorry missed to update this issue, this is already fixed in this PR: #687. The root cause is when validating the map type field, the fields not shown in configuration also get validated. The fix removed the fix on those non-embedding fields, so you should not see this error and no need to worry about the field name causing the issue.

@zane-neo zane-neo closed this as completed Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants