Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add ML limitation for ingesting large documents #2877

Merged
merged 4 commits into from
Nov 27, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,16 @@
The following limitations and known problems apply to the {version} release of
the Elastic {nlp} trained models feature.

[discrete]
[[ml-nlp-large-documents-limit-10k-10mb]]
== Semantic text fields are limited at 10k chunks, limiting ingested document size to under ~10MB

When using semantic text to ingest documents chunking takes place automatically. The number
of chunks is limited by the cluster setting index.mapping.nested_objects.limit
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html
which defaults to 10k. This means that documents which are too large will cause errors during
ingestion. To avoid this issue, please split your documents into roughly 1MB parts before ingestion.

[discrete]
[[ml-nlp-elser-v1-limit-512]]
== ELSER semantic search is limited to 512 tokens per field that inference is applied to
Expand All @@ -17,4 +27,4 @@ When you use ELSER for semantic search, only the first 512 extracted tokens from
each field of the ingested documents that ELSER is applied to are taken into
account for the search process. If your data set contains long documents, divide
them into smaller segments before ingestion if you need the full text to be
searchable.
searchable.