diff --git a/AI_Agents_Guide/Constrained_Decoding/README.md b/AI_Agents_Guide/Constrained_Decoding/README.md new file mode 100644 index 00000000..28e07417 --- /dev/null +++ b/AI_Agents_Guide/Constrained_Decoding/README.md @@ -0,0 +1,703 @@ + + +# Constrained Decoding with Triton Inference Server + +This tutorial focuses on constrained decoding, an important technique for +ensuring that large language models (LLMs) generate outputs that adhere +to strict formatting requirements—requirements that may be challenging or +expensive to achieve solely through fine-tuning. + +## Table of Contents + +- [Introduction to Constrained Decoding](#introduction-to-constrained-decoding) +- [Prerequisite: Hermes-2-Pro-Llama-3-8B](#prerequisite-hermes-2-pro-llama-3-8b) +- [Structured Generation via Prompt Engineering](#structured-generation-via-prompt-engineering) + * [Example 1](#example-1) + * [Example 2](#example-2) +- [Enforcig Output Format via External Libraries](#enforcig-output-format-via-external-libraries) + * [Pre-requisite: Common set-up](#pre-requisite-common-set-up) + + [Logits Post-Processor](#logits-post-processor) + + [Tokenizer](#tokenizer) + + [Repository set up](#repository-set-up) + * [LM Format Enforcer](#lm-format-enforcer) + * [Outlines](#outlines) + +## Introduction to Constrained Decoding + +Constrained decoding is a powerful technique used in natural language processing +and various AI applications to guide and control the output of a model. +By imposing specific constraints, this method ensures that generated outputs +adhere to predefined criteria, such as length, format, or content restrictions. +This capability is essential in contexts where compliance with rules +is non-negotiable, such as producing valid code snippets, structured data, +or grammatically correct sentences. + +In recent advancements, some models are already fine-tuned to incorporate +these constraints inherently. These models are designed +to seamlessly integrate constraints during the generation process, reducing +the need for extensive post-processing. By doing so, they enhance the efficiency +and accuracy of tasks that require strict adherence to predefined rules. +This built-in capability makes them particularly valuable in applications +like automated content creation, data validation, and real-time language +translation, where precision and reliability are paramount. + +This tutorial is based on [Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B), +, which already supports JSON Structured Outputs. An extensive instruction stack +on deploying Hermes-2-Pro-Llama-3-8B model with Triton Inference Server and +TensorRT-LLM backend can be found in [this](../../Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md) +tutorial. The structure and quality of a produced output in such cases can be +controlled through prompt engineering. To explore this path, please refer to +[Structured Generation via Prompt Engineering](#structured-generation-via-prompt-engineering) +section on the tutorial. + +For scenarios where models are not inherently fine-tuned for +constrained decoding, or when more precise control over the output is desired, +dedicated libraries like +[*LM Format Enforcer*](https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file) +and [*Outlines*](https://github.com/outlines-dev/outlines?tab=readme-ov-file) +offer robust solutions. These libraries provide tools to enforce specific +constraints on model outputs, allowing developers to tailor the generation +process to meet precise requirements. By leveraging such libraries, +users can achieve greater control over the output, ensuring it aligns perfectly +with the desired criteria, whether that involves maintaining a certain format, +adhering to content guidelines, or ensuring grammatical correctness. +In this tutorial we'll show how to use *LM Format Enforcer* and *Outlines* +in your workflow. + +## Prerequisite: Hermes-2-Pro-Llama-3-8B + +Before proceeding, please make sure that you've successfully deployed +[Hermes-2-Pro-Llama-3-8B.](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) +model with Triton Inference Server and TensorRT-LLM backend +following [these steps.](../../Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md) + +## Structured Generation via Prompt Engineering + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tutorials:/tutorials \ + -v /path/to/Hermes-2-Pro-Llama-3-8B/repo:/Hermes-2-Pro-Llama-3-8B \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +The provided client script uses `pydantic` library, which we do not ship with +the sdk container. Make sure to install it, before proceeding: + +```bash +pip install pydantic +``` + +### Example 1 + +For fine-tuned model we can enable JSON mode by simply composing a system prompt +as: + +``` +You are a helpful assistant that answers in JSON. +``` +Please, refer to [`client.py`](./artifacts/client.py) for full `prompt` +composition logic. + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Give me information about Harry Potter and the Order of Phoenix" -o 200 --use-system-prompt +``` +You should expect the following response: + +``` +... +assistant +{ + "title": "Harry Potter and the Order of Phoenix", + "book_number": 5, + "author": "J.K. Rowling", + "series": "Harry Potter", + "publication_date": "June 21, 2003", + "page_count": 766, + "publisher": "Arthur A. Levine Books", + "genre": [ + "Fantasy", + "Adventure", + "Young Adult" + ], + "awards": [ + { + "award_name": "British Book Award", + "category": "Children's Book of the Year", + "year": 2004 + } + ], + "plot_summary": "Harry Potter and the Order of Phoenix is the fifth book in the Harry Potter series. In this installment, Harry returns to Hogwarts School of Witchcraft and Wizardry for his fifth year. The Ministry of Magic is in denial about the return of Lord Voldemort, and Harry finds himself battling against the + +``` + +### Example 2 + +Optionally, we can also restrict an output to a specific schema. For example, +in [`client.py`](./artifacts/client.py) we use a `pydantic` library to define the +following answer format: + +```python +from pydantic import BaseModel + +class AnswerFormat(BaseModel): + title: str + year: int + director: str + producer: str + plot: str + +... + +prompt += "Here's the json schema you must adhere to:\n\n{schema}\n".format( + schema=AnswerFormat.model_json_schema()) + +``` +Let's try it out: + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Give me information about Harry Potter and the Order of Phoenix" -o 200 --use-system-prompt --use-schema +``` +You should expect the following response: + +``` + ... +assistant +{ + "title": "Harry Potter and the Order of Phoenix", + "year": 2007, + "director": "David Yates", + "producer": "David Heyman", + "plot": "Harry Potter and his friends must protect Hogwarts from a threat when the Ministry of Magic is taken over by Lord Voldemort's followers." +} + +``` + +## Enforcig Output Format via External Libraries + +In this section of the tutorial, we'll show how to impose constrains on LLMs, +which are not inherently fine-tuned for constrained decoding. We'll +[*LM Format Enforcer*](https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file) +and [*Outlines*](https://github.com/outlines-dev/outlines?tab=readme-ov-file) +offer robust solutions. + +The reference implementation for both libraries is provided in +[`utils.py`](./artifacts/utils.py) script, which also defines the output +format `AnswerFormat`: + +```python +class WandFormat(BaseModel): + wood: str + core: str + length: float + +class AnswerFormat(BaseModel): + name: str + house: str + blood_status: str + occupation: str + alive: str + wand: WandFormat +``` + +### Pre-requisite: Common set-up + +Make sure you've successfully deployed Hermes-2-Pro-Llama-3-8B model +with Triton Inference Server and TensorRT-LLM backend following +[these steps](../../Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md). +> [!IMPORTANT] +> Make sure that the `tutorials` folder is mounted to `/tutorials`, when you +> start the docker container. + + +Upon successful setup you should have `/opt/tritonserver/inflight_batcher_llm` +folder and try a couple of inference requests (e.g. those, provided in +[example 1](#example-1) or [example 2](#example-2)). + +We'll do some adjusments to model files, thu if you have a running server, you +can stop it via: +```bash +pkill tritonserver +``` + +#### Logits Post-Processor + +Both of the libraries limit the set of allowed tokens at every generation stage. +In TensorRT-LLM, user can define a custom +[logits post-processor](https://nvidia.github.io/TensorRT-LLM/advanced/batch-manager.html#logits-post-processor-optional) +to mask logits, which should never be used in the current generation step. + +For TensorRT-LLM models, deployed via `python` backend (i.e. when +[`triton_backend`](https://github.com/triton-inference-server/tensorrtllm_backend/blob/8aaf89bcf723dad112839fd36cbbe09e2e439c63/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L28C10-L28C29) +is set to `python` in `tensorrt_llm/config.pbtxt`, Triton's python backend will +use +[`model.py`](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py) +to serve your TensorRT-LLM model.), custom logits processor should be specified +during model's initialization as a part of +[Executor's](https://nvidia.github.io/TensorRT-LLM/executor.html#executor-api) +configuration +([`logits_post_processor_map`](https://github.com/NVIDIA/TensorRT-LLM/blob/32ed92e4491baf2d54682a21d247e1948cca996e/tensorrt_llm/hlapi/llm_utils.py#L205)). +Below is the sample for reference. + +```diff +... + ++ executor_config.logits_post_processor_map = { ++ "": custom_logits_processor ++ } +self.executor = trtllm.Executor(model_path=..., + model_type=..., + executor_config=executor_config) +... +``` + +Additionally, if you want to enable logits post-processor for every request +individually, you can do so via an additional `input` parameter. +For example, in this tutorial we will add `logits_post_processor_name` in +`inflight_batcher_llm/tensorrt_llm/config.pbtxt`: +```diff +input [ + { + name: "input_ids" + data_type: TYPE_INT32 + dims: [ -1 ] + allow_ragged_batch: true + }, + ... + { + name: "lora_config" + data_type: TYPE_INT32 + dims: [ -1, 3 ] + optional: true + allow_ragged_batch: true +- } ++ }, ++ { ++ name: "logits_post_processor_name" ++ data_type: TYPE_STRING ++ dims: [ -1 ] ++ optional: true ++ } +] +... +``` +and process it in `execute` function in +`inflight_batcher_llm/tensorrt_llm/1/model.py`: + +```diff +def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + ... + + for request in requests: + response_sender = request.get_response_sender() + if get_input_scalar_by_name(request, 'stop'): + self.handle_stop_request(request.request_id(), response_sender) + else: + try: + converted = convert_request(request, + self.exclude_input_from_output, + self.decoupled) ++ logits_post_processor_name = get_input_tensor_by_name(request, 'logits_post_processor_name') ++ if logits_post_processor_name is not None: ++ converted.logits_post_processor_name = logits_post_processor_name.item().decode('utf-8') + except Exception as e: + ... +``` +In this tutorial, we're deploying Hermes-2-Pro-Llama-3-8B model as a part of an +ensemble. This means that the request is processed by the `ensemble` model +first, and then it is sent to `pre-processing model`, `tensorrt-llm model`, and +finally `post-processing`. This sequence defined in +`inflight_batcher_llm/ensemble/config.pbtxt` as well as input and output +mappings. Thus, we would need to update +`inflight_batcher_llm/ensemble/config.pbtxt` as well, so that `ensemble` model +properly passes additional input parameter to `tensorrt-llm model`: + +```diff +input [ + { + name: "text_input" + data_type: TYPE_STRING + dims: [ -1 ] + }, + ... + { + name: "embedding_bias_weights" + data_type: TYPE_FP32 + dims: [ -1 ] + optional: true +- } ++ }, ++ { ++ name: "logits_post_processor_name" ++ data_type: TYPE_STRING ++ dims: [ -1 ] ++ optional: true ++ } +] +output [ + ... +] +ensemble_scheduling { + step [ + { + model_name: "preprocessing" + model_version: -1 + ... + }, + { + model_name: "tensorrt_llm" + model_version: -1 + input_map { + key: "input_ids" + value: "_INPUT_ID" + } + ... + input_map { + key: "bad_words_list" + value: "_BAD_WORDS_IDS" + } ++ input_map { ++ key: "logits_post_processor_name" ++ value: "logits_post_processor_name" ++ } + output_map { + key: "output_ids" + value: "_TOKENS_BATCH" + } + ... + } + ... +``` + +If you follow along with this tutorial, make sure same changes are incorporated +into corresponding files of `/opt/tritonserver/inflight_batcher_llm` repository. + +#### Tokenizer + +Both [*LM Format Enforcer*](https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file) +and [*Outlines*](https://github.com/outlines-dev/outlines?tab=readme-ov-file) +require tokenizer access at initialization time. In this tutorial, +we'll be exposing tokenizer via `inflight_batcher_llm/tensorrt_llm/config.pbtxt` +parameter: + +```txt +parameters: { + key: "tokenizer_dir" + value: { + string_value: "/Hermes-2-Pro-Llama-3-8B" + } +} +``` +Simply append to the end on the `inflight_batcher_llm/tensorrt_llm/config.pbtxt`. + +#### Repository set up + +We've provided a sample implementation for *LM Format Enforcer* and *Outlines* +in [`artifacts/utils.py`](./artifacts/utils.py). Make sure you've copied it into +`/opt/tritonserver/inflight_batcher_llm/tensorrt_llm/1/lib` via + +```bash +mkdir -p inflight_batcher_llm/tensorrt_llm/1/lib +cp /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py inflight_batcher_llm/tensorrt_llm/1/lib/ +``` +Finally, let's install all required libraries: + +```bash +pip install pydantic lm-format-enforcer outlines setuptools +``` + +### LM Format Enforcer + +To use LM Format Enforcer, make sure +`inflight_batcher_llm/tensorrt_llm/1/model.py` contains the following changes: + +```diff +... +import tensorrt_llm.bindings.executor as trtllm + ++ from lib.utils import LMFELogitsProcessor, AnswerFormat + +... + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + ... + + def get_executor_config(self, model_config): ++ tokenizer_dir = model_config['parameters']['tokenizer_dir']['string_value'] ++ logits_lmfe_processor = LMFELogitsProcessor(tokenizer_dir, AnswerFormat.model_json_schema()) + kwargs = { + "max_beam_width": + get_parameter(model_config, "max_beam_width", int), + "scheduler_config": + self.get_scheduler_config(model_config), + "kv_cache_config": + self.get_kv_cache_config(model_config), + "enable_chunked_context": + get_parameter(model_config, "enable_chunked_context", bool), + "normalize_log_probs": + get_parameter(model_config, "normalize_log_probs", bool), + "batching_type": + convert_batching_type(get_parameter(model_config, + "gpt_model_type")), + "parallel_config": + self.get_parallel_config(model_config), + "peft_cache_config": + self.get_peft_cache_config(model_config), + "decoding_config": + self.get_decoding_config(model_config), ++ "logits_post_processor_map":{ ++ LMFELogitsProcessor.PROCESSOR_NAME: logits_lmfe_processor ++ } + } + kwargs = {k: v for k, v in kwargs.items() if v is not None} + return trtllm.ExecutorConfig(**kwargs) +... +``` + +#### Send an inference request + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tutorials/:/tutorials \ + -v /path/to/tutorials/repo:/tutorials \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +The provided client script uses `pydantic` library, which we do not ship with +the sdk container. Make sure to install it, before proceeding: + +```bash +pip install pydantic +``` + +##### Option 1. Use provided [client script](./artifacts/client.py) + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 +``` + +You should expect the following response: + +```bash +Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 --logits-post-processor-name "lmfe" +``` + +This time, the expected response looks like: +```bash +Who is Harry Potter? + { + "name": "Harry Potter", + "occupation": "Wizard", + "house": "Gryffindor", + "wand": { + "wood": "Holly", + "core": "Phoenix feather", + "length": 11 + }, + "blood_status": "Pure-blood", + "alive": "Yes" + } +``` +As we can see, the schema, defined in [`utils.py`](./artifacts/utils.py) is +respected. Note, LM Format Enforcer lets LLM to control the order of generated +fields, thus re-ordering of fields is allowed. + +##### Option 2. Use [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint). + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' +``` + +You should expect the following response: + +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and"} +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "logits_post_processor_name": "lmfe"}' +``` + +This time, the expected response looks like: +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter? \t\t\t\n\t\t{\n\t\t\t\"name\": \"Harry Potter\",\n\t\t\t\"occupation\": \"Wizard\",\n\t\t\t\"house\": \"Gryffindor\",\n\t\t\t\"wand\": {\n\t\t\t\t\"wood\": \"Holly\",\n\t\t\t\t\"core\": \"Phoenix feather\",\n\t\t\t\t\"length\": 11\n\t\t\t},\n\t\t\t\"blood_status\": \"Pure-blood\",\n\t\t\t\"alive\": \"Yes\"\n\t\t}\n\n\t\t\n\n\n\n\t\t\n"} +``` + +### Outlines + +To use Outlines, make sure +`inflight_batcher_llm/tensorrt_llm/1/model.py` contains the following changes: + +```diff +... +import tensorrt_llm.bindings.executor as trtllm + ++ from lib.utils import OutlinesLogitsProcessor, AnswerFormat + +... + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + ... + + def get_executor_config(self, model_config): ++ tokenizer_dir = model_config['parameters']['tokenizer_dir']['string_value'] ++ logits_lmfe_processor = OutlinesLogitsProcessor(tokenizer_dir, AnswerFormat.model_json_schema()) + kwargs = { + "max_beam_width": + get_parameter(model_config, "max_beam_width", int), + "scheduler_config": + self.get_scheduler_config(model_config), + "kv_cache_config": + self.get_kv_cache_config(model_config), + "enable_chunked_context": + get_parameter(model_config, "enable_chunked_context", bool), + "normalize_log_probs": + get_parameter(model_config, "normalize_log_probs", bool), + "batching_type": + convert_batching_type(get_parameter(model_config, + "gpt_model_type")), + "parallel_config": + self.get_parallel_config(model_config), + "peft_cache_config": + self.get_peft_cache_config(model_config), + "decoding_config": + self.get_decoding_config(model_config), ++ "logits_post_processor_map":{ ++ OutlinesLogitsProcessor.PROCESSOR_NAME: logits_lmfe_processor ++ } + } + kwargs = {k: v for k, v in kwargs.items() if v is not None} + return trtllm.ExecutorConfig(**kwargs) +... +``` + +#### Send an inference request + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tutorials/:/tutorials \ + -v /path/to/tutorials/repo:/tutorials \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +The provided client script uses `pydantic` library, which we do not ship with +the sdk container. Make sure to install it, before proceeding: + +```bash +pip install pydantic +``` + +##### Option 1. Use provided [client script](./artifacts/client.py) + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 +``` + +You should expect the following response: + +```bash +Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 --logits-post-processor-name "outlines" +``` + +This time, the expected response looks like: +```bash +Who is Harry Potter?{ "name": "Harry Potter","house": "Gryffindor","blood_status": "Pure-blood","occupation": "Wizards","alive": "No","wand": {"wood": "Holly","core": "Phoenix feather","length": 11 }} +``` +As we can see, the schema, defined in [`utils.py`](./artifacts/utils.py) is +respected. Note, LM Format Enforcer lets LLM to control the order of generated +fields, thus re-ordering of fields is allowed. + +##### Option 2. Use [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint). + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' +``` + +You should expect the following response: + +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and"} +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "logits_post_processor_name": "outlines"}' +``` + +This time, the expected response looks like: +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter?{ \"name\": \"Harry Potter\",\"house\": \"Gryffindor\",\"blood_status\": \"Pure-blood\",\"occupation\": \"Wizards\",\"alive\": \"No\",\"wand\": {\"wood\": \"Holly\",\"core\": \"Phoenix feather\",\"length\": 11 }}"} +``` \ No newline at end of file diff --git a/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py b/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py new file mode 100755 index 00000000..f9f2a6e8 --- /dev/null +++ b/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py @@ -0,0 +1,278 @@ +#!/usr/bin/python +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import argparse +import sys + +import client_utils +import numpy as np +import tritonclient.grpc as grpcclient +from pydantic import BaseModel + + +class AnswerFormat(BaseModel): + title: str + year: int + director: str + producer: str + plot: str + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-u", "--url", type=str, required=False, help="Inference server URL." + ) + + parser.add_argument("-p", "--prompt", type=str, required=True, help="Input prompt.") + + parser.add_argument( + "--model-name", + type=str, + required=False, + default="ensemble", + choices=["ensemble", "tensorrt_llm_bls"], + help="Name of the Triton model to send request to", + ) + + parser.add_argument( + "-S", + "--streaming", + action="store_true", + required=False, + default=False, + help="Enable streaming mode. Default is False.", + ) + + parser.add_argument( + "-b", + "--beam-width", + required=False, + type=int, + default=1, + help="Beam width value", + ) + + parser.add_argument( + "--temperature", + type=float, + required=False, + default=1.0, + help="temperature value", + ) + + parser.add_argument( + "--repetition-penalty", + type=float, + required=False, + default=None, + help="The repetition penalty value", + ) + + parser.add_argument( + "--presence-penalty", + type=float, + required=False, + default=None, + help="The presence penalty value", + ) + + parser.add_argument( + "--frequency-penalty", + type=float, + required=False, + default=None, + help="The frequency penalty value", + ) + + parser.add_argument( + "-o", + "--output-len", + type=int, + default=100, + required=False, + help="Specify output length", + ) + + parser.add_argument( + "--request-id", + type=str, + default="", + required=False, + help="The request_id for the stop request", + ) + + parser.add_argument("--stop-words", nargs="+", default=[], help="The stop words") + + parser.add_argument("--bad-words", nargs="+", default=[], help="The bad words") + + parser.add_argument( + "--embedding-bias-words", nargs="+", default=[], help="The biased words" + ) + + parser.add_argument( + "--embedding-bias-weights", + nargs="+", + default=[], + help="The biased words weights", + ) + + parser.add_argument( + "--overwrite-output-text", + action="store_true", + required=False, + default=False, + help="In streaming mode, overwrite previously received output text instead of appending to it", + ) + + parser.add_argument( + "--return-context-logits", + action="store_true", + required=False, + default=False, + help="Return context logits, the engine must be built with gather_context_logits or gather_all_token_logits", + ) + + parser.add_argument( + "--return-generation-logits", + action="store_true", + required=False, + default=False, + help="Return generation logits, the engine must be built with gather_ generation_logits or gather_all_token_logits", + ) + + parser.add_argument( + "--end-id", type=int, required=False, help="The token id for end token." + ) + + parser.add_argument( + "--pad-id", type=int, required=False, help="The token id for pad token." + ) + + parser.add_argument( + "--use-system-prompt", + action="store_true", + required=False, + default=False, + help="Enhance text input with system prompt.", + ) + + parser.add_argument( + "--use-schema", + action="store_true", + required=False, + default=False, + help="Use client-defined JSON schema.", + ) + + parser.add_argument( + "--logits-post-processor-name", + type=str, + required=False, + default=None, + help="Logits Post-Processor to use for output generation.", + ) + + FLAGS = parser.parse_args() + if FLAGS.url is None: + FLAGS.url = "localhost:8001" + + embedding_bias_words = ( + FLAGS.embedding_bias_words if FLAGS.embedding_bias_words else None + ) + embedding_bias_weights = ( + FLAGS.embedding_bias_weights if FLAGS.embedding_bias_weights else None + ) + + try: + client = grpcclient.InferenceServerClient(url=FLAGS.url) + except Exception as e: + print("client creation failed: " + str(e)) + sys.exit(1) + + return_context_logits_data = None + if FLAGS.return_context_logits: + return_context_logits_data = np.array( + [[FLAGS.return_context_logits]], dtype=bool + ) + + return_generation_logits_data = None + if FLAGS.return_generation_logits: + return_generation_logits_data = np.array( + [[FLAGS.return_generation_logits]], dtype=bool + ) + + prompt = FLAGS.prompt + + if FLAGS.use_system_prompt: + prompt = ( + "<|im_start|>system\n You are a helpful assistant that answers in JSON." + ) + + if FLAGS.use_schema: + prompt += "Here's the json schema you must adhere to:\n\n{schema}\n".format( + schema=AnswerFormat.model_json_schema() + ) + + prompt += "<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n".format( + user_prompt=FLAGS.prompt + ) + + output_text = client_utils.run_inference( + client, + prompt, + FLAGS.output_len, + FLAGS.request_id, + FLAGS.repetition_penalty, + FLAGS.presence_penalty, + FLAGS.frequency_penalty, + FLAGS.temperature, + FLAGS.stop_words, + FLAGS.bad_words, + embedding_bias_words, + embedding_bias_weights, + FLAGS.model_name, + FLAGS.streaming, + FLAGS.beam_width, + FLAGS.overwrite_output_text, + return_context_logits_data, + return_generation_logits_data, + FLAGS.end_id, + FLAGS.pad_id, + FLAGS.verbose, + logits_post_processor_name=FLAGS.logits_post_processor_name, + ) + + print(output_text) diff --git a/AI_Agents_Guide/Constrained_Decoding/artifacts/client_utils.py b/AI_Agents_Guide/Constrained_Decoding/artifacts/client_utils.py new file mode 100755 index 00000000..1890f3c8 --- /dev/null +++ b/AI_Agents_Guide/Constrained_Decoding/artifacts/client_utils.py @@ -0,0 +1,225 @@ +#!/usr/bin/python + +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import queue +from functools import partial + +import numpy as np +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException, np_to_triton_dtype + + +def prepare_tensor(name, input): + t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype)) + t.set_data_from_numpy(input) + return t + + +class UserData: + def __init__(self): + self._completed_requests = queue.Queue() + + +def callback(user_data, result, error): + if error: + user_data._completed_requests.put(error) + else: + user_data._completed_requests.put(result) + + +def run_inference( + triton_client, + prompt, + output_len, + request_id, + repetition_penalty, + presence_penalty, + frequency_penalty, + temperature, + stop_words, + bad_words, + embedding_bias_words, + embedding_bias_weights, + model_name, + streaming, + beam_width, + overwrite_output_text, + return_context_logits_data, + return_generation_logits_data, + end_id, + pad_id, + verbose, + num_draft_tokens=0, + use_draft_logits=None, + logits_post_processor_name=None, +): + input0 = [[prompt]] + input0_data = np.array(input0).astype(object) + output0_len = np.ones_like(input0).astype(np.int32) * output_len + streaming_data = np.array([[streaming]], dtype=bool) + beam_width_data = np.array([[beam_width]], dtype=np.int32) + temperature_data = np.array([[temperature]], dtype=np.float32) + + inputs = [ + prepare_tensor("text_input", input0_data), + prepare_tensor("max_tokens", output0_len), + prepare_tensor("stream", streaming_data), + prepare_tensor("beam_width", beam_width_data), + prepare_tensor("temperature", temperature_data), + ] + + if num_draft_tokens > 0: + inputs.append( + prepare_tensor( + "num_draft_tokens", np.array([[num_draft_tokens]], dtype=np.int32) + ) + ) + if use_draft_logits is not None: + inputs.append( + prepare_tensor( + "use_draft_logits", np.array([[use_draft_logits]], dtype=bool) + ) + ) + + if bad_words: + bad_words_list = np.array([bad_words], dtype=object) + inputs += [prepare_tensor("bad_words", bad_words_list)] + + if stop_words: + stop_words_list = np.array([stop_words], dtype=object) + inputs += [prepare_tensor("stop_words", stop_words_list)] + + if repetition_penalty is not None: + repetition_penalty = [[repetition_penalty]] + repetition_penalty_data = np.array(repetition_penalty, dtype=np.float32) + inputs += [prepare_tensor("repetition_penalty", repetition_penalty_data)] + + if presence_penalty is not None: + presence_penalty = [[presence_penalty]] + presence_penalty_data = np.array(presence_penalty, dtype=np.float32) + inputs += [prepare_tensor("presence_penalty", presence_penalty_data)] + + if frequency_penalty is not None: + frequency_penalty = [[frequency_penalty]] + frequency_penalty_data = np.array(frequency_penalty, dtype=np.float32) + inputs += [prepare_tensor("frequency_penalty", frequency_penalty_data)] + + if return_context_logits_data is not None: + inputs += [ + prepare_tensor("return_context_logits", return_context_logits_data), + ] + + if return_generation_logits_data is not None: + inputs += [ + prepare_tensor("return_generation_logits", return_generation_logits_data), + ] + + if (embedding_bias_words is not None and embedding_bias_weights is None) or ( + embedding_bias_words is None and embedding_bias_weights is not None + ): + assert 0, "Both embedding bias words and weights must be specified" + + if embedding_bias_words is not None and embedding_bias_weights is not None: + assert len(embedding_bias_words) == len( + embedding_bias_weights + ), "Embedding bias weights and words must have same length" + embedding_bias_words_data = np.array([embedding_bias_words], dtype=object) + embedding_bias_weights_data = np.array( + [embedding_bias_weights], dtype=np.float32 + ) + inputs.append(prepare_tensor("embedding_bias_words", embedding_bias_words_data)) + inputs.append( + prepare_tensor("embedding_bias_weights", embedding_bias_weights_data) + ) + if end_id is not None: + end_id_data = np.array([[end_id]], dtype=np.int32) + inputs += [prepare_tensor("end_id", end_id_data)] + + if pad_id is not None: + pad_id_data = np.array([[pad_id]], dtype=np.int32) + inputs += [prepare_tensor("pad_id", pad_id_data)] + + if logits_post_processor_name is not None: + logits_post_processor_name_data = np.array( + [[logits_post_processor_name]], dtype=object + ) + inputs += [ + prepare_tensor( + "logits_post_processor_name", logits_post_processor_name_data + ) + ] + + user_data = UserData() + # Establish stream + triton_client.start_stream(callback=partial(callback, user_data)) + # Send request + triton_client.async_stream_infer(model_name, inputs, request_id=request_id) + + # Wait for server to close the stream + triton_client.stop_stream() + + # Parse the responses + output_text = "" + while True: + try: + result = user_data._completed_requests.get(block=False) + except Exception: + break + + if type(result) == InferenceServerException: + print("Received an error from server:") + print(result) + else: + output = result.as_numpy("text_output") + if streaming and beam_width == 1: + new_output = output[0].decode("utf-8") + if overwrite_output_text: + output_text = new_output + else: + output_text += new_output + else: + output_text = output[0].decode("utf-8") + if verbose: + print(output, flush=True) + + if return_context_logits_data is not None: + context_logits = result.as_numpy("context_logits") + if verbose: + print(f"context_logits.shape: {context_logits.shape}") + print(f"context_logits: {context_logits}") + if return_generation_logits_data is not None: + generation_logits = result.as_numpy("generation_logits") + if verbose: + print(f"generation_logits.shape: {generation_logits.shape}") + print(f"generation_logits: {generation_logits}") + + if streaming and beam_width == 1: + if verbose: + print(output_text) + + return output_text diff --git a/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py b/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py new file mode 100644 index 00000000..70ef4237 --- /dev/null +++ b/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py @@ -0,0 +1,187 @@ +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json +from collections import defaultdict +from typing import DefaultDict, Dict, List + +import torch +from lmformatenforcer import JsonSchemaParser, TokenEnforcer +from lmformatenforcer.integrations.trtllm import build_trtlmm_tokenizer_data +from outlines.fsm.guide import RegexGuide +from outlines.fsm.json_schema import build_regex_from_schema +from outlines.integrations.utils import adapt_tokenizer +from pydantic import BaseModel +from transformers import AutoTokenizer + + +class WandFormat(BaseModel): + """Represents the format of a wand description. + + Attributes: + wood (str): The type of wood used in the wand. + core (str): The core material of the wand. + length (float): The length of the wand. + """ + + wood: str + core: str + length: float + + +class AnswerFormat(BaseModel): + """Represents the output format, which LLM should follow. + + Attributes: + name (str): The name of the person. + house (str): The house affiliation of the person (e.g., Gryffindor). + blood_status (str): The blood status (e.g., pure-blood). + occupation (str): The occupation of the person. + alive (str): Whether the person is alive. + wand (WandFormat): The wand information. + """ + + name: str + house: str + blood_status: str + occupation: str + alive: str + wand: WandFormat + + +class LMFELogitsProcessor: + """ + The class implementing logits post-processor via LM Format Enforcer. + """ + + PROCESSOR_NAME = "lmfe" + + def __init__(self, tokenizer_dir, schema): + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_dir, legacy=False, padding_side="left", trust_remote_code=True + ) + self.eos_token = tokenizer.eos_token_id + tokenizer_data = build_trtlmm_tokenizer_data(tokenizer) + # TokenEnforcer provides a token filtering mechanism, + # given a tokenizer and a CharacterLevelParser. + # ref: https://github.com/noamgat/lm-format-enforcer/blob/fe6cbf107218839624e3ab39b47115bf7f64dd6e/lmformatenforcer/tokenenforcer.py#L32 + self.token_enforcer = TokenEnforcer(tokenizer_data, JsonSchemaParser(schema)) + + def get_allowed_tokens(self, ids): + def _trim(ids): + return [x for x in ids if x != self.eos_token] + + allowed = self.token_enforcer.get_allowed_tokens(_trim(ids[0])) + return allowed + + def __call__( + self, + req_id: int, + logits: torch.Tensor, + ids: List[List[int]], + stream_ptr: int, + ): + # Create a mask with negative infinity to block all tokens initially. + mask = torch.full_like(logits, fill_value=float("-inf"), device=logits.device) + allowed = self.get_allowed_tokens(ids) + # Update the mask to zero for allowed tokens, + # allowing them to be selected. + mask[:, :, allowed] = 0 + with torch.cuda.stream(torch.cuda.ExternalStream(stream_ptr)): + logits += mask + + +class OutlinesLogitsProcessor: + """ + The class implementing logits post-processor via Outlines. + """ + + PROCESSOR_NAME = "outlines" + + def __init__(self, tokenizer_dir, schema): + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_dir, legacy=False, padding_side="left", trust_remote_code=True + ) + tokenizer = adapt_tokenizer(tokenizer) + regex_string = build_regex_from_schema(json.dumps(schema)) + self.fsm = RegexGuide(regex_string, tokenizer) + self._fsm_state: DefaultDict[int, int] = defaultdict(int) + self.mask_cache: Dict[int, torch.Tensor] = {} + # By default, TensorRT-LLM includes request query into the output. + # Outlines should only look at generated outputs, thus we'll keep + # track of the request's input prefix. + self._prefix = [-1] + + def __call__( + self, + req_id: int, + logits: torch.Tensor, + ids: List[List[int]], + stream_ptr: int, + ): + seq_id = None + # If the prefix token IDs have changed we assume that we are dealing + # with a new sample and reset the FSM state + if ( + ids[0][: len(self._prefix)] != self._prefix + # handling edge case, when the new request is identical to already + # processed + or len(ids[0][len(self._prefix) :]) == 0 + ): + self._fsm_state = defaultdict(int) + self._prefix = ids[0] + seq_id = hash(tuple([])) + + else: + # Remove the prefix token IDs from the input token IDs, + # because the FSM should only be applied to the generated tokens + ids = ids[0][len(self._prefix) :] + last_token = ids[-1] + last_seq_id = hash(tuple(ids[:-1])) + seq_id = hash(tuple(ids)) + self._fsm_state[seq_id] = self.fsm.get_next_state( + state=self._fsm_state[last_seq_id], token_id=last_token + ) + + state_id = self._fsm_state[seq_id] + if state_id not in self.mask_cache: + allowed_tokens = self.fsm.get_next_instruction( + state=self._fsm_state[seq_id] + ).tokens + # Create a mask with negative infinity to block all + # tokens initially. + mask = torch.full_like( + logits, fill_value=float("-inf"), device=logits.device + ) + # Update the mask to zero for allowed tokens, + # allowing them to be selected. + mask[:, :, allowed_tokens] = 0 + self.mask_cache[state_id] = mask + else: + mask = self.mask_cache[state_id] + + with torch.cuda.stream(torch.cuda.ExternalStream(stream_ptr)): + logits += mask diff --git a/AI_Agents_Guide/Function_Calling/README.md b/AI_Agents_Guide/Function_Calling/README.md new file mode 100644 index 00000000..447e539c --- /dev/null +++ b/AI_Agents_Guide/Function_Calling/README.md @@ -0,0 +1,304 @@ + + +# Function Calling with Triton Inference Server + +This tutorial focuses on function calling, a common approach to easily connect +large language models (LLMs) to external tools. This method empowers AI agents +with effective tool usage and seamless interaction with external APIs, +significantly expanding their capabilities and practical applications. + +## Table of Contents + +- [What is Function Calling?](#what-is-function-calling) +- [Tutorial Overview](#tutorial-overview) + + [Prerequisite: Hermes-2-Pro-Llama-3-8B](#prerequisite-hermes-2-pro-llama-3-8b) +- [Function Definitions](#function-definitions) +- [Prompt Engineering](#prompt-engineering) +- [Combining Everything Together](#combining-everything-together) +- [Further Optimizations](#further-optimizations) + + [Enforcing Output Format](#enforcing-output-format) + + [Parallel Tool Call](#parallel-tool-call) +- [References](#references) + +## What is Function Calling? + +Function calling refers to the ability of LLMs to: + * Recognize when a specific function or tool needs to be used to answer a query + or perform a task. + * Generate a structured output containing the necessary arguments to call + that function. + * Integrate the results of the function call into its response. + +Function calling is a powerful mechanism that allows LLMs to perform +more complex tasks (e.g. agent orchestration in multi-agent systems) +that require specific computations or data retrieval +beyond their inherent knowledge. By recognizing when a particular function +is needed, LLMs can dynamically extend their functionality, making them more +versatile and useful in real-world applications. + +## Tutorial Overview + +This tutorial demonstrates function calling using the +[Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) +model, which is pre-fine-tuned for this capability. We'll create a basic +stock reporting agent that provides up-to-date stock information and summarizes +recent company news. + +### Prerequisite: Hermes-2-Pro-Llama-3-8B + +Before proceeding, please make sure that you've successfully deployed +[Hermes-2-Pro-Llama-3-8B.](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) +model with Triton Inference Server and TensorRT-LLM backend +following [these steps.](../../Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md) + +> [!IMPORTANT] +> Make sure that the `tutorials` folder is mounted to `/tutorials`, when you +> start the docker container. + +## Function Definitions + +We'll define three functions for our stock reporting agent: +1. `get_current_stock_price`: Retrieves the current stock price for a given symbol. +2. `get_company_news`: Retrieves company news and press releases for a given stock symbol. +3. `final_answer`: Used as a no-op and to indicate the final response. + +Each function includes its name, description, and input parameter schema: + ```python +TOOLS = [ + { + "type": "function", + "function": { + "name": "get_current_stock_price", + "description": "Get the current stock price for a given symbol.\n\nArgs:\n symbol (str): The stock symbol.\n\nReturns:\n float: The current stock price, or None if an error occurs.", + "parameters": { + "type": "object", + "properties": {"symbol": {"type": "string"}}, + "required": ["symbol"], + }, + }, + }, + { + "type": "function", + "function": { + "name": "get_company_news", + "description": "Get company news and press releases for a given stock symbol.\n\nArgs:\nsymbol (str): The stock symbol.\n\nReturns:\npd.DataFrame: DataFrame containing company news and press releases.", + "parameters": { + "type": "object", + "properties": {"symbol": {"type": "string"}}, + "required": ["symbol"], + }, + }, + }, + { + "type": "function", + "function": { + "name": "final_answer", + "description": "Return final generated answer", + "parameters": { + "type": "object", + "properties": {"final_response": {"type": "string"}}, + "required": ["final_response"], + }, + }, + }, +] + ``` +These function definitions will be passed to our model through a prompt, +enabling it to recognize and utilize them appropriately during the conversation. + +For the actual implementations, please refer to [client_utils.py.](./artifacts/client_utils.py) + +## Prompt Engineering + +**Prompt engineering** is a crucial aspect of function calling, as it guides +the LLM in recognizing when and how to utilize specific functions. +By carefully crafting prompts, you can effectively define the LLM's role, +objectives, and the tools it can access, ensuring accurate and efficient task +execution. + +For our task, we've organized a sample prompt structure, provided +in the accompanying [`system_prompt_schema.yml`](./artifacts/system_prompt_schema.yml) +file. This file meticulously outlines: + +- **Role**: Defines the specific role the LLM is expected to perform. +- **Objective**: Clearly states the goal or desired outcome of the interaction. +- **Tools**: Lists the available functions or tools the LLM can use to achieve +its objective. +- **Schema**: Specifies the structure and format required for calling each tool +or function. +- **Instructions**: Provides a clear set of guidelines to ensure the LLM follows +the intended path and utilizes the tools appropriately. + +By leveraging prompt engineering, you can enhance the LLM's ability +to perform complex tasks and integrate function calls seamlessly into +its responses, thereby maximizing its utility in various applications. + +## Combining Everything Together + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tutorials/:/tutorials \ + -v /path/to/tutorials/repo:/tutorials \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +The provided client script uses `pydantic` and `yfinance` libraries, which we +do not ship with the sdk container. Make sure to install it, before proceeding: + +```bash +pip install pydantic yfinance +``` + +Run the provided [`client.py`](./artifacts/client.py) as follows: + +```bash +python3 /tutorials/AI_Agents_Guide/Function_Calling/artifacts/client.py --prompt "Tell me about Rivian. Include current stock price in your final response." -o 200 +``` + +You should expect to see a response similar to: + +```bash ++++++++++++++++++++++++++++++++++++++ +RESPONSE: Rivian, with its current stock price of , ++++++++++++++++++++++++++++++++++++++ +``` + +To see what tools were "called" by our LLM, simply add `verbose` flag as follows: +```bash +python3 /tutorials/AI_Agents_Guide/Function_Calling/artifacts/client.py --prompt "Tell me about Rivian. Include current stock price in your final response." -o 200 --verbose +``` + +This will show the step-by-step process of function calling, including: +- The tools being called +- The arguments passed to each tool +- The responses from each function call +- The final summarized response + + +```bash +[b'\n{\n "step": "1",\n "description": "Get the current stock price for Rivian",\n "tool": "get_current_stock_price",\n "arguments": {\n "symbol": "RIVN"\n }\n}'] +===================================== +Executing function: get_current_stock_price({'symbol': 'RIVN'}) +Function response: +===================================== +[b'\n{\n "step": "2",\n "description": "Get company news and press releases for Rivian",\n "tool": "get_company_news",\n "arguments": {\n "symbol": "RIVN"\n }\n}'] +===================================== +Executing function: get_company_news({'symbol': 'RIVN'}) +Function response: [] +===================================== +[b'\n{\n "step": "3",\n "description": "Summarize the company news and press releases for Rivian",\n "tool": "final_answer",\n "arguments": {\n "final_response": "Rivian, with its current stock price of , "\n }\n}'] + + ++++++++++++++++++++++++++++++++++++++ +RESPONSE: Rivian, with its current stock price of , ++++++++++++++++++++++++++++++++++++++ +``` + +> [!TIP] +> In this tutorial, all functionalities (tool definitions, implementations, +> and executions) are implemented on the client side (see +> [client.py](./artifacts/client.py)). +> For production scenarios, especially when functions are known beforehand, +> consider implementing this logic on the server side. +> A recommended approach for server-side implementation is to deploy your +> workflow through a Triton [ensemble](https://github.com/triton-inference-server/server/blob/a6fff975a214ff00221790dd0a5521fb05ce3ac9/docs/user_guide/architecture.md#ensemble-models) +> or a [BLS](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#business-logic-scripting). +> Use a pre-processing model to combine and format the user prompt with the +> system prompt and available tools. Employ a post-processing model to manage +> multiple calls to the deployed LLM as needed to reach the final answer. + +## Further Optimizations + +### Enforcing Output Format + +In this tutorial, we demonstrated how to enforce a specific output format +using prompt engineering. The desired structure is as follows: +```python + { + "step" : + "description": + "tool": , + "arguments": { + + } + } +``` +However, there may be instances where the output deviates from this +required schema. For example, consider the following prompt execution: + +```bash +python3 /tutorials/AI_Agents_Guide/Function_Calling/artifacts/client.py --prompt "How Rivian is doing?" -o 500 --verbose +``` +This execution may fail with an invalid JSON format error. The verbose +output will reveal that the final LLM response contained plain text +instead of the expected JSON format: +``` +{ + "step": "3", + "description": + "tool": "final_answer", + "arguments": { + "final_response": + } +} +``` +Fortunately, this behavior can be controlled using constrained decoding, +a technique that guides the model to generate outputs that meet specific +formatting and content requirements. We strongly recommend exploring our +dedicated [tutorial](../Constrained_Decoding/README.md) on constrained decoding +to gain deeper insights and enhance your ability to manage model outputs +effectively. + +> [!TIP] +> For optimal results, utilize the `FunctionCall` class defined in +> [client_utils.py](./artifacts/client_utils.py) as the JSON schema +> for your Logits Post-Processor. This approach ensures consistent +> and properly formatted outputs, aligning with the structure we've +> established throughout this tutorial. + +### Parallel Tool Call + +This tutorial focuses on a single turn forced call, the LLM is prompted +to make a specific function call within a single interaction. This approach is +useful when a precise action is needed immediately, ensuring that +the function is executed as part of the current conversation. + +It is possible, that come of function calls can be executed simultaneously. +This technique is beneficial for tasks that can be divided into independent +operations, allowing for increased efficiency and reduced response time. + +We encourage our readers to take on the challenge of implementing +parallel tool calls as a practical exercise. + +## References + +Parts of this tutorial are based of [Hermes-Function-Calling](https://github.com/NousResearch/Hermes-Function-Calling). \ No newline at end of file diff --git a/AI_Agents_Guide/Function_Calling/artifacts/client.py b/AI_Agents_Guide/Function_Calling/artifacts/client.py new file mode 100755 index 00000000..518a33ad --- /dev/null +++ b/AI_Agents_Guide/Function_Calling/artifacts/client.py @@ -0,0 +1,276 @@ +#!/usr/bin/python +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import argparse +import json +import sys + +import client_utils +import numpy as np +import tritonclient.grpc as grpcclient + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-u", "--url", type=str, required=False, help="Inference server URL." + ) + + parser.add_argument("-p", "--prompt", type=str, required=True, help="Input prompt.") + + parser.add_argument( + "--model-name", + type=str, + required=False, + default="ensemble", + choices=["ensemble", "tensorrt_llm_bls"], + help="Name of the Triton model to send request to", + ) + + parser.add_argument( + "-S", + "--streaming", + action="store_true", + required=False, + default=False, + help="Enable streaming mode. Default is False.", + ) + + parser.add_argument( + "-b", + "--beam-width", + required=False, + type=int, + default=1, + help="Beam width value", + ) + + parser.add_argument( + "--temperature", + type=float, + required=False, + default=1.0, + help="temperature value", + ) + + parser.add_argument( + "--repetition-penalty", + type=float, + required=False, + default=None, + help="The repetition penalty value", + ) + + parser.add_argument( + "--presence-penalty", + type=float, + required=False, + default=None, + help="The presence penalty value", + ) + + parser.add_argument( + "--frequency-penalty", + type=float, + required=False, + default=None, + help="The frequency penalty value", + ) + + parser.add_argument( + "-o", + "--output-len", + type=int, + default=100, + required=False, + help="Specify output length", + ) + + parser.add_argument( + "--request-id", + type=str, + default="", + required=False, + help="The request_id for the stop request", + ) + + parser.add_argument("--stop-words", nargs="+", default=[], help="The stop words") + + parser.add_argument("--bad-words", nargs="+", default=[], help="The bad words") + + parser.add_argument( + "--embedding-bias-words", nargs="+", default=[], help="The biased words" + ) + + parser.add_argument( + "--embedding-bias-weights", + nargs="+", + default=[], + help="The biased words weights", + ) + + parser.add_argument( + "--overwrite-output-text", + action="store_true", + required=False, + default=False, + help="In streaming mode, overwrite previously received output text instead of appending to it", + ) + + parser.add_argument( + "--return-context-logits", + action="store_true", + required=False, + default=False, + help="Return context logits, the engine must be built with gather_context_logits or gather_all_token_logits", + ) + + parser.add_argument( + "--return-generation-logits", + action="store_true", + required=False, + default=False, + help="Return generation logits, the engine must be built with gather_ generation_logits or gather_all_token_logits", + ) + + parser.add_argument( + "--end-id", type=int, required=False, help="The token id for end token." + ) + + parser.add_argument( + "--pad-id", type=int, required=False, help="The token id for pad token." + ) + + FLAGS = parser.parse_args() + if FLAGS.url is None: + FLAGS.url = "localhost:8001" + + embedding_bias_words = ( + FLAGS.embedding_bias_words if FLAGS.embedding_bias_words else None + ) + embedding_bias_weights = ( + FLAGS.embedding_bias_weights if FLAGS.embedding_bias_weights else None + ) + + try: + client = grpcclient.InferenceServerClient(url=FLAGS.url) + except Exception as e: + print("client creation failed: " + str(e)) + sys.exit(1) + + return_context_logits_data = None + if FLAGS.return_context_logits: + return_context_logits_data = np.array( + [[FLAGS.return_context_logits]], dtype=bool + ) + + return_generation_logits_data = None + if FLAGS.return_generation_logits: + return_generation_logits_data = np.array( + [[FLAGS.return_generation_logits]], dtype=bool + ) + + prompt = client_utils.process_prompt(FLAGS.prompt) + + functions = client_utils.MyFunctions() + + while True: + output_text = client_utils.run_inference( + client, + prompt, + FLAGS.output_len, + FLAGS.request_id, + FLAGS.repetition_penalty, + FLAGS.presence_penalty, + FLAGS.frequency_penalty, + FLAGS.temperature, + FLAGS.stop_words, + FLAGS.bad_words, + embedding_bias_words, + embedding_bias_weights, + FLAGS.model_name, + FLAGS.streaming, + FLAGS.beam_width, + FLAGS.overwrite_output_text, + return_context_logits_data, + return_generation_logits_data, + FLAGS.end_id, + FLAGS.pad_id, + FLAGS.verbose, + ) + + try: + response = json.loads(output_text) + except ValueError: + print("\n[ERROR] LLM responded with invalid JSON format!") + break + + # Repeat the loop until `final_answer` tool is called, which indicates + # that the full response is ready and llm does not require any + # additional information. Additionally, if the loop has taken more + # than 50 steps, the script ends. + if response["tool"] == "final_answer" or response["step"] == "50": + if response["tool"] == "final_answer": + final_response = response["arguments"]["final_response"] + print("\n\n+++++++++++++++++++++++++++++++++++++") + print(f"RESPONSE: {final_response}") + print("+++++++++++++++++++++++++++++++++++++\n\n") + elif response["step"] == "50": + print("\n\n+++++++++++++++++++++++++++++++++++++") + print(f"Reached maximum number of function calls available.") + print("+++++++++++++++++++++++++++++++++++++\n\n") + break + + # Extract tool's name and arguments from the response + function_name = response["tool"] + function_args = response["arguments"] + function_to_call = getattr(functions, function_name) + # Execute function call and store results in `function_response` + function_response = function_to_call(*function_args.values()) + + if FLAGS.verbose: + print("=====================================") + print(f"Executing function: {function_name}({function_args}) ") + print(f"Function response: {str(function_response)}") + print("=====================================") + + # Update prompt with the generated function call and results of that + # call. + results_dict = f'{{"name": "{function_name}", "content": {function_response}}}' + prompt += str( + output_text + + "<|im_end|>\n" + + str(results_dict) + + "\n<|im_start|>assistant" + ) diff --git a/AI_Agents_Guide/Function_Calling/artifacts/client_utils.py b/AI_Agents_Guide/Function_Calling/artifacts/client_utils.py new file mode 100644 index 00000000..552c98fc --- /dev/null +++ b/AI_Agents_Guide/Function_Calling/artifacts/client_utils.py @@ -0,0 +1,442 @@ +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +import os +import sys +from functools import partial +from pathlib import Path +from typing import Dict + +import pandas as pd +import yaml +import yfinance as yf +from pydantic import BaseModel + +sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) + +import queue +import sys + +import numpy as np +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException, np_to_triton_dtype + +############################################################################### +# TOOLS Definition and Implementation # +############################################################################### + +TOOLS = [ + { + "type": "function", + "function": { + "name": "get_current_stock_price", + "description": "Get the current stock price for a given symbol.\n\nArgs:\n symbol (str): The stock symbol.\n\nReturns:\n float: The current stock price, or None if an error occurs.", + "parameters": { + "type": "object", + "properties": {"symbol": {"type": "string"}}, + "required": ["symbol"], + }, + }, + }, + { + "type": "function", + "function": { + "name": "get_company_news", + "description": "Get company news and press releases for a given stock symbol.\n\nArgs:\nsymbol (str): The stock symbol.\n\nReturns:\npd.DataFrame: DataFrame containing company news and press releases.", + "parameters": { + "type": "object", + "properties": {"symbol": {"type": "string"}}, + "required": ["symbol"], + }, + }, + }, + { + "type": "function", + "function": { + "name": "final_answer", + "description": "Return final generated answer", + "parameters": { + "type": "object", + "properties": {"final_response": {"type": "string"}}, + "required": ["final_response"], + }, + }, + }, +] + + +class MyFunctions: + def get_company_news(self, symbol: str) -> pd.DataFrame: + """ + Get company news and press releases for a given stock symbol. + + Args: + symbol (str): The stock symbol. + + Returns: + pd.DataFrame: DataFrame containing company news and press releases. + """ + try: + news = yf.Ticker(symbol).news + title_list = [] + for entry in news: + title_list.append(entry["title"]) + return title_list + except Exception as e: + print(f"Error fetching company news for {symbol}: {e}") + return pd.DataFrame() + + def get_current_stock_price(self, symbol: str) -> float: + """ + Get the current stock price for a given symbol. + + Args: + symbol (str): The stock symbol. + + Returns: + float: The current stock price, or None if an error occurs. + """ + try: + stock = yf.Ticker(symbol) + # Use "regularMarketPrice" for regular market hours, or "currentPrice" for pre/post market + current_price = stock.info.get( + "regularMarketPrice", stock.info.get("currentPrice") + ) + return current_price if current_price else None + except Exception as e: + print(f"Error fetching current price for {symbol}: {e}") + return None + + +############################################################################### +# Helper Schemas # +############################################################################### + + +class FunctionCall(BaseModel): + step: str + """Step number for the action sequence""" + description: str + """Description of what the step does and its output""" + + tool: str + """The name of the tool to call.""" + + arguments: dict + """ + The arguments to call the function with, as generated by the model in JSON + format. Note that the model does not always generate valid JSON, and may + hallucinate parameters not defined by your function schema. Validate the + arguments in your code before calling your function. + """ + + +class PromptSchema(BaseModel): + Role: str + """Defines the specific role the LLM is expected to perform.""" + Objective: str + """States the goal or desired outcome of the interaction.""" + Tools: str + """A set of available functions or tools the LLM can use to achieve its + objective.""" + Schema: str + """ Specifies the structure and format required for calling each tool + or function.""" + Instructions: str + """Provides a clear set of guidelines to ensure the LLM follows + the intended path and utilizes the tools appropriately.""" + + +############################################################################### +# Prompt processing helper functions # +############################################################################### + + +def read_yaml_file(file_path: str) -> PromptSchema: + """ + Reads a YAML file and converts its content into a PromptSchema object. + + Args: + file_path (str): The path to the YAML file. + + Returns: + PromptSchema: An object containing the structured prompt data. + """ + with open(file_path, "r") as file: + yaml_content = yaml.safe_load(file) + + prompt_schema = PromptSchema( + Role=yaml_content.get("Role", ""), + Objective=yaml_content.get("Objective", ""), + Tools=yaml_content.get("Tools", ""), + Schema=yaml_content.get("Schema", ""), + Instructions=yaml_content.get("Instructions", ""), + ) + return prompt_schema + + +def format_yaml_prompt(prompt_schema: PromptSchema, variables: Dict) -> str: + """ + Formats the prompt schema with provided variables. + + Args: + prompt_schema (PromptSchema): The prompt schema to format. + variables (Dict): A dictionary of variables to insert into the prompt. + + Returns: + str: The formatted prompt string. + """ + formatted_prompt = "" + for field, value in prompt_schema.model_dump().items(): + formatted_value = value.format(**variables) + if field == "Instructions": + formatted_prompt += f"{formatted_value}" + else: + formatted_value = formatted_value.replace("\n", " ") + formatted_prompt += f"{formatted_value}" + return formatted_prompt + + +def process_prompt( + user_prompt, + system_prompt_yml=Path(__file__).parent.joinpath("./system_prompt_schema.yml"), + tools=TOOLS, + schema_json=FunctionCall.model_json_schema(), +): + """ + Combines and formats the user prompt with a system prompt for model + processing. + + This function reads a system prompt from a YAML file, formats it with the + provided tools and schema, and integrates it with the user's original + prompt. The result is a structured prompt ready for input into a + language model. + + Args: + user_prompt (str): The initial prompt provided by the user. + system_prompt_yml (str, optional): The file path to the system prompt + defined in a YAML file. Defaults to "./system_prompt_schema.yml". + tools (list, optional): A list of tools available for the prompt. + Defaults to the global TOOLS variable. + schema_json (dict, optional): A JSON schema for a generated function call. + Defaults to the schema from FunctionCall.model_json_schema(). + + Returns: + str: A formatted prompt string ready for use by the language model. + """ + prompt_schema = read_yaml_file(system_prompt_yml) + variables = {"tools": tools, "schema": schema_json} + sys_prompt = format_yaml_prompt(prompt_schema, variables) + processed_prompt = f"<|im_start|>system\n {sys_prompt}<|im_end|>\n" + processed_prompt += f"<|im_start|>user\n {user_prompt}\nThis is the first turn and you don't have to analyze yet. <|im_end|>\n <|im_start|>assistant" + return processed_prompt + + +############################################################################### +# Triton client helper functions # +############################################################################### + + +def prepare_tensor(name, input): + t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype)) + t.set_data_from_numpy(input) + return t + + +class UserData: + def __init__(self): + self._completed_requests = queue.Queue() + + +def callback(user_data, result, error): + if error: + user_data._completed_requests.put(error) + else: + user_data._completed_requests.put(result) + + +def run_inference( + triton_client, + prompt, + output_len, + request_id, + repetition_penalty, + presence_penalty, + frequency_penalty, + temperature, + stop_words, + bad_words, + embedding_bias_words, + embedding_bias_weights, + model_name, + streaming, + beam_width, + overwrite_output_text, + return_context_logits_data, + return_generation_logits_data, + end_id, + pad_id, + verbose, + num_draft_tokens=0, + use_draft_logits=None, +): + input0 = [[prompt]] + input0_data = np.array(input0).astype(object) + output0_len = np.ones_like(input0).astype(np.int32) * output_len + streaming_data = np.array([[streaming]], dtype=bool) + beam_width_data = np.array([[beam_width]], dtype=np.int32) + temperature_data = np.array([[temperature]], dtype=np.float32) + + inputs = [ + prepare_tensor("text_input", input0_data), + prepare_tensor("max_tokens", output0_len), + prepare_tensor("stream", streaming_data), + prepare_tensor("beam_width", beam_width_data), + prepare_tensor("temperature", temperature_data), + ] + + if num_draft_tokens > 0: + inputs.append( + prepare_tensor( + "num_draft_tokens", np.array([[num_draft_tokens]], dtype=np.int32) + ) + ) + if use_draft_logits is not None: + inputs.append( + prepare_tensor( + "use_draft_logits", np.array([[use_draft_logits]], dtype=bool) + ) + ) + + if bad_words: + bad_words_list = np.array([bad_words], dtype=object) + inputs += [prepare_tensor("bad_words", bad_words_list)] + + if stop_words: + stop_words_list = np.array([stop_words], dtype=object) + inputs += [prepare_tensor("stop_words", stop_words_list)] + + if repetition_penalty is not None: + repetition_penalty = [[repetition_penalty]] + repetition_penalty_data = np.array(repetition_penalty, dtype=np.float32) + inputs += [prepare_tensor("repetition_penalty", repetition_penalty_data)] + + if presence_penalty is not None: + presence_penalty = [[presence_penalty]] + presence_penalty_data = np.array(presence_penalty, dtype=np.float32) + inputs += [prepare_tensor("presence_penalty", presence_penalty_data)] + + if frequency_penalty is not None: + frequency_penalty = [[frequency_penalty]] + frequency_penalty_data = np.array(frequency_penalty, dtype=np.float32) + inputs += [prepare_tensor("frequency_penalty", frequency_penalty_data)] + + if return_context_logits_data is not None: + inputs += [ + prepare_tensor("return_context_logits", return_context_logits_data), + ] + + if return_generation_logits_data is not None: + inputs += [ + prepare_tensor("return_generation_logits", return_generation_logits_data), + ] + + if (embedding_bias_words is not None and embedding_bias_weights is None) or ( + embedding_bias_words is None and embedding_bias_weights is not None + ): + assert 0, "Both embedding bias words and weights must be specified" + + if embedding_bias_words is not None and embedding_bias_weights is not None: + assert len(embedding_bias_words) == len( + embedding_bias_weights + ), "Embedding bias weights and words must have same length" + embedding_bias_words_data = np.array([embedding_bias_words], dtype=object) + embedding_bias_weights_data = np.array( + [embedding_bias_weights], dtype=np.float32 + ) + inputs.append(prepare_tensor("embedding_bias_words", embedding_bias_words_data)) + inputs.append( + prepare_tensor("embedding_bias_weights", embedding_bias_weights_data) + ) + if end_id is not None: + end_id_data = np.array([[end_id]], dtype=np.int32) + inputs += [prepare_tensor("end_id", end_id_data)] + + if pad_id is not None: + pad_id_data = np.array([[pad_id]], dtype=np.int32) + inputs += [prepare_tensor("pad_id", pad_id_data)] + + user_data = UserData() + # Establish stream + triton_client.start_stream(callback=partial(callback, user_data)) + # Send request + triton_client.async_stream_infer(model_name, inputs, request_id=request_id) + + # Wait for server to close the stream + triton_client.stop_stream() + + # Parse the responses + output_text = "" + while True: + try: + result = user_data._completed_requests.get(block=False) + except Exception: + break + + if type(result) == InferenceServerException: + print("Received an error from server:") + print(result) + else: + output = result.as_numpy("text_output") + if streaming and beam_width == 1: + new_output = output[0].decode("utf-8") + if overwrite_output_text: + output_text = new_output + else: + output_text += new_output + else: + output_text = output[0].decode("utf-8") + if verbose: + print( + str("\n[VERBOSE MODE] LLM's response:" + output_text), + flush=True, + ) + + if return_context_logits_data is not None: + context_logits = result.as_numpy("context_logits") + if verbose: + print(f"context_logits.shape: {context_logits.shape}") + print(f"context_logits: {context_logits}") + if return_generation_logits_data is not None: + generation_logits = result.as_numpy("generation_logits") + if verbose: + print(f"generation_logits.shape: {generation_logits.shape}") + print(f"generation_logits: {generation_logits}") + + if streaming and beam_width == 1: + if verbose: + print(output_text) + + return output_text diff --git a/AI_Agents_Guide/Function_Calling/artifacts/system_prompt_schema.yml b/AI_Agents_Guide/Function_Calling/artifacts/system_prompt_schema.yml new file mode 100644 index 00000000..80ae7b6b --- /dev/null +++ b/AI_Agents_Guide/Function_Calling/artifacts/system_prompt_schema.yml @@ -0,0 +1,54 @@ +Role: | + You are an expert assistant who can solve any task using JSON tool calls. + You will be given a task to solve as best you can. + These tools are basically Python functions which you can call with code. + If your task is not related to any of available tools, don't use any of + available tools. +Objective: | + You may use agentic frameworks for reasoning and planning to help with user query. + Please call a function and wait for function results to be provided to you in the next iteration. + Don't make assumptions about what values to plug into function arguments. + Once you have called a function, results will be fed back to you within XML tags + in the following form: + {{"name": , "content": }} + Don't make assumptions about tool results if XML tags are not present since function hasn't been executed yet. + Analyze the data once you get the results and call another function. + Your final response should directly answer the user query with an analysis or summary of the results of function calls. + You MUST summarise all previous responses in the final response. +Tools: | + Only use the set of these available tools: + {tools} + If none of those tools are related to the task, then only use `final_answer` + to provide your response. +Schema: | + Use the following pydantic model json schema for each tool call you will make: + {schema} +Instructions: | + Output a step-by-step plan to solve the task using the given tools. + This plan should involve individual tasks based on the available tools, + that if executed correctly will yield the correct answer. + Each step should be structured as follows: + {{ + "step" : + "description": + "tool": , + "arguments": {{ + + }} + }} + Each step must be necessary to reach the final answer. + Steps should reuse outputs produced by earlier steps. + The last step must be the final answer. It is the only way to complete + the task, else you will be stuck on a loop. + So your final output should look like this: + {{ + "step" : + "description": "Provide the final answer", + "tool": "final_answer", + "arguments": {{ + "final_response": + }} + }} + Calling multiple functions at once can overload the system and increase + cost so call one function at a time please. + If you plan to continue with analysis, always call another function. diff --git a/AI_Agents_Guide/README.md b/AI_Agents_Guide/README.md new file mode 100644 index 00000000..3224d7a8 --- /dev/null +++ b/AI_Agents_Guide/README.md @@ -0,0 +1,62 @@ + + +# Guide to Deploying AI Agents with Triton Inference Server + +Welcome to the **Guide to Deploying AI Agents with Triton Inference Server**. +This repository provides a set of tutorials designed to help you deploy +AI agents efficiently using the Triton Inference Server. This guide is intended +for users who are already familiar with the basics of Triton and are looking to +expand their knowledge. + +For beginners, we recommend starting with the +[Conceptual Guide](tutorials/Conceptual_Guide/README.md), which covers +foundational concepts and basic setup of Triton Inference Server. + +## AI agents and Agentic Workflows + +Modern large language models (LLMs) are integral components of AI agents — +sophisticated self-governing systems that make decisions by interacting with +their environment and analyzing the data they gather. By integrating LLMs, +AI agents can understand, generate, and respond to human language with high +proficiency, enabling them to perform complex tasks such as language +translation, content generation, and conversational interactions. + + +## Table of Contents + +- [Constrained Decoding](Constrained_Decoding/README.md) + * Learn about constrained decoding, how to implement it in Triton, + and explore practical examples and use cases. +- [Function Calling](Function_Calling/README.md) + * Discover how to set up and utilize function calling within AI models using + Triton. This section includes detailed instructions and examples to help you + integrate function calling into your deployments. + + + diff --git a/Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md b/Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md new file mode 100644 index 00000000..c5e9d77c --- /dev/null +++ b/Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md @@ -0,0 +1,236 @@ + + +# Deploying Hermes-2-Pro-Llama-3-8B Model with Triton Inference Server + +The [Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) +is an advanced language model developed by [NousResearch](https://nousresearch.com/). +This model is an enhancement of the Meta-Llama-3-8B finetuned in-house using the +OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and +JSON Mode dataset developed by NousResearch. These advancements enable the model +to excel in both general conversational tasks and specialized functions like +structured JSON outputs and function calling, making it a versatile tool for +various applications. + +The model is available for download through [huggingface](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B). + +TensorRT-LLM is Nvidia's recommended solution of running Large Language +Models(LLMs) on Nvidia GPUs. Read more about TensoRT-LLM [here](https://github.com/NVIDIA/TensorRT-LLM) +and Triton's TensorRT-LLM Backend [here](https://github.com/triton-inference-server/tensorrtllm_backend). + +*NOTE:* If some parts of this tutorial doesn't work, it is possible that there +are some version mismatches between the `tutorials` and `tensorrtllm_backend` +repository. Refer to [llama.md](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md) +for more detailed modifications if necessary. And if you are familiar with +python, you can also try using +[High-level API](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/high-level-api/README.md) +for LLM workflow. + +## Prerequisite: TensorRT-LLM backend + +This tutorial requires TensorRT-LLM Backend repository. Please note, +that for best user experience we recommend using the latest +[release tag](https://github.com/triton-inference-server/tensorrtllm_backend/tags) +of `tensorrtllm_backend` and +the latest [Triton Server container.](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags) + +To clone TensorRT-LLM Backend repository, make sure to run the following +set of commands. +```bash +git clone https://github.com/triton-inference-server/tensorrtllm_backend.git --branch +# Update the submodules +cd tensorrtllm_backend +# Install git-lfs if needed +apt-get update && apt-get install git-lfs -y --no-install-recommends +git lfs install +git submodule update --init --recursive +``` + +## Launch Triton TensorRT-LLM container + +Launch Triton docker container with TensorRT-LLM backend. +Note that we're mounting `tensorrtllm_backend` to `/tensorrtllm_backend` +and the Hermes model to `/Hermes-2-Pro-Llama-3-8B` in the docker container for +simplicity. Make an `engines` folder outside docker to reuse engines for future +runs. Please, make sure to replace with the version of Triton that you +want to use. + +```bash +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v :/tensorrtllm_backend \ + -v :/Hermes-2-Pro-Llama-3-8B \ + -v :/engines \ + nvcr.io/nvidia/tritonserver:-trtllm-python-py3 +``` + +Alternatively, you can follow instructions +[here](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#build-the-docker-container) +to build Triton Server with Tensorrt-LLM Backend if you want to build +a specialized container. + +Don't forget to allow gpu usage when you launch the container. + +## Create Engines for each model [skip this step if you already have an engine] + +TensorRT-LLM requires each model to be compiled for the configuration +you need before running. To do so, before you run your model for the first time +on Triton Server you will need to create a TensorRT-LLM engine. + +Triton Server TensrRT-LLM container comes with pre-installed TensorRT-LLM +package, which allows users to build engines inside the Triton container. +Simply follow the next steps: + +```bash +HF_LLAMA_MODEL=/Hermes-2-Pro-Llama-3-8B +UNIFIED_CKPT_PATH=/tmp/ckpt/hermes/8b/ +ENGINE_DIR=/engines +CONVERT_CHKPT_SCRIPT=/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py +python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${HF_LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 +trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \ + --remove_input_padding enable \ + --gpt_attention_plugin float16 \ + --context_fmha enable \ + --gemm_plugin float16 \ + --output_dir ${ENGINE_DIR} \ + --paged_kv_cache enable \ + --max_batch_size 4 +``` +> Optional: You can check test the output of the model with `run.py` +> located in the same llama examples folder. +> +> ```bash +> python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 28 --tokenizer_dir ${HF_LLAMA_MODEL} --input_text "What is ML?" +> ``` +> You should expect the following response: +> ``` +> Input [Text 0]: "<|begin_of_text|>What is ML?" +> Output [Text 0 Beam 0]: " +> Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed." +> ``` + +## Serving with Triton + +The last step is to create a Triton readable model. You can +find a template of a model that uses inflight batching in +[tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm). +To run our model, you will need to: + + +1. Copy over the inflight batcher models repository + +```bash +cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/. +``` + +2. Modify `config.pbtxt` for the preprocessing, postprocessing and processing +steps. The following script do a minimized configuration to run tritonserver, +but if you want optimal performance or custom parameters, read details in +[documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md) +and [perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md): + +```bash +# preprocessing +TOKENIZER_DIR=/Hermes-2-Pro-Llama-3-8B/ +TOKENIZER_TYPE=auto +DECOUPLED_MODE=false +MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm +MAX_BATCH_SIZE=4 +INSTANCE_COUNT=1 +MAX_QUEUE_DELAY_MS=10000 +TRTLLM_BACKEND=python +FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py +python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} +python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT} +python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT} +python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} +python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRTLLM_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching +``` + +3. Launch Tritonserver + +Use the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script. This launches multiple instances of `tritonserver` with MPI. +```bash +python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size= --model_repo=/opt/tritonserver/inflight_batcher_llm +``` +> You should expect the following response: +> ``` +> ... +> I0503 22:01:25.210518 1175 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:8001 +> I0503 22:01:25.211612 1175 http_server.cc:4692] Started HTTPService at 0.0.0.0:8000 +> I0503 22:01:25.254914 1175 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002 +> ``` + +To stop Triton Server inside the container, run: +```bash +pkill tritonserver +``` + +## Send an inference request + +You can test the results of the run with: +1. The [inflight_batcher_llm_client.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/inflight_batcher_llm_client.py) script. + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tensorrtllm_backend/inflight_batcher_llm/client:/tensorrtllm_client \ + -v /path/to/Hermes-2-Pro-Llama-3-8B/repo:/Hermes-2-Pro-Llama-3-8B \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +Additionally, please install extra dependencies for the script: +```bash +pip3 install transformers sentencepiece +python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len 28 --tokenizer-dir /Hermes-2-Pro-Llama-3-8B --text "What is ML?" +``` +> You should expect the following response: +> ``` +> ... +> Input: What is ML? +> Output beam 0: +> ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation. +> ... +> ``` + +2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint). + +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' +``` +> You should expect the following response: +> ``` +> {"context_logits":0.0,...,"text_output":"What is ML?\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed."} +> ``` + + +## References + +For more examples feel free to refer to [End to end workflow to run llama.](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md) \ No newline at end of file diff --git a/README.md b/README.md index 6cad9fdd..446e4902 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,7 @@ This repository contains the following resources: * [HuggingFace Guide](./HuggingFace/): The focus of this guide is to walk the user through different methods in which a HuggingFace model can be deployed using the Triton Inference Server. * [Feature Guides](./Feature_Guide/): This folder is meant to house Triton's feature-specific examples. * [Migration Guide](./Migration_Guide/migration_guide.md): Migrating from an existing solution to Triton Inference Server? Get an understanding of the general architecture that might best fit your use case. +* [Agentic Workflow Guide](./AI_Agents_Guide/): This guide provides a set of tutorials designed to help you deploy AI agents efficiently using the Triton Inference Server. ## Navigating Triton Inference Server Resources