-
Notifications
You must be signed in to change notification settings - Fork 122
components llm_rag_crack_chunk_embed_index_and_register
Creates chunks no larger than chunk_size
from input_data
, extracted document titles are prepended to each chunk\n\n
LLM models have token limits for the prompts passed to them, this is a limiting factor at embedding time and even more limiting at prompt completion time as only so much context can be passed along with instructions to the LLM and user queries.\n Chunking allows splitting source data of various formats into small but coherent snippets of information which can be 'packed' into LLM prompts when asking for answers to user queries related to the source documents.\n\n
Supported formats: md, txt, html/htm, pdf, ppt(x), doc(x), xls(x), py\n\n
Also generates embeddings vectors for data chunks if configured.\n\n
If embeddings_container
is supplied, input chunks are compared to existing chunks in the Embeddings Container and only changed/new chunks are embedded, existing chunks being reused.\n\n
After indexing completes, a MLIndex yaml and supporting files are registered as an AzureML data asset.\n\n"
Version: 0.0.31
Preview
View in Studio: https://ml.azure.com/registries/azureml/components/llm_rag_crack_chunk_embed_index_and_register/version/0.0.31
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
input_data | uri_folder | False | |||
embeddings_container | Folder containing previously generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks. | uri_folder | True | ||
asset_uri | Where to save MLIndex | uri_folder | True | ||
input_glob | Limit files opened from input_data , defaults to '**/*' |
string | */ | False | |
chunk_size | Maximum number of tokens per chunk. | integer | 768 | False | |
chunk_overlap | Number of tokens to overlap between chunks. | integer | 0 | False | |
use_rcts | Use langchain RecursiveTextSplitter to split chunks. | boolean | True | False | |
citation_url | Base URL to join with file paths to create full source file URL for chunk metadata. | string | True | ||
citation_replacement_regex | A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url. | string | True | ||
doc_intel_connection_id | AzureML Connection ID for Custom Workspace Connection containing the endpoint key and api_key secret for an Azure AI Document Intelligence Service. |
string | True | ||
embeddings_model | The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}' | string | True | ||
embeddings_connection_id | The connection id of the Embeddings Model provider to use. | string | False | ||
batch_size | Batch size to use when embedding data. | integer | 100 | False | |
num_workers | Number of workers to use when embedding data. | integer | -1 | False | |
asset_name | Name of the asset to register. | string | False | ||
acs_config | JSON string containing the ACS configuration. e.g. {"index_name": "my-index"} | string | False | ||
index_connection_id | The connection id of the ACS provider to use. | string | True | ||
validate_deployments | Enables Validation of Model and Index deployments. | string | True | ||
llm_config | JSON string containing the LLM configuration. | string | True | ||
llm_connection_id | The connection id of the LLM provider to use. | string | True |
azureml:llm-rag-embeddings@latest