diff --git a/docs/source/guides/neuronx_tgi.mdx b/docs/source/guides/neuronx_tgi.mdx index efc63eb5d..9993b4ff2 100644 --- a/docs/source/guides/neuronx_tgi.mdx +++ b/docs/source/guides/neuronx_tgi.mdx @@ -2,183 +2,4 @@ Text Generation Inference ([TGI](https://huggingface.co/docs/text-generation-inference/)) is a toolkit for deploying and serving Large Language Models (LLMs). -It is available for Inferentia2. - -## Features - -The basic TGI features are supported: - -- continuous batching, -- token streaming, -- greedy search and multinomial sampling using [transformers](https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation). - -## License - -NeuronX TGI is released under an [Apache2 License](https://github.com/huggingface/text-generation-inference?tab=Apache-2.0-1-ov-file#readme). - -## Deploy the service from the Hugging Face hub - -The simplest way to deploy the NeuronX TGI service for a specific model is to follow the -deployment instructions in the model card: - -- click on the "Deploy" button on the right, -- select your deployment service ("Inference Endpoints" and "SageMaker" are supported), -- select "AWS Inferentia", -- follow the instructions. - - -## Deploy the service on a dedicated host - -The service is launched simply by running the neuronx-tgi container with two sets of parameters: - -``` -docker run ghcr.io/huggingface/neuronx-tgi:latest -``` - -- system parameters are used to map ports, volumes and devices between the host and the service, -- service parameters are forwarded to the `text-generation-launcher`. - -When deploying a service, you will need a pre-compiled Neuron model. The NeuronX TGI service supports two main modes of operation: - -- you can either deploy the service on a model that has already been exported to Neuron, -- or alternatively you can take advantage of the Neuron Model Cache to export your own model. - -### Common system parameters - -Whenever you launch a TGI service, we highly recommend you to mount a shared volume mounted as `/data` in the container: this is where -the models will be cached to speed up further instantiations of the service. - -Note also that enough neuron devices should be visible by the container.The simplest way to achieve that is to launch the service in `privileged` mode to get access to all neuron devices. -Alternatively, each device can be explicitly exposed using the `--device` option. - -Finally, you might want to export the `HF_TOKEN` if you want to access gated repositories. - -Here is an example of a service instantiation: - -``` -docker run -p 8080:80 \ - -v $(pwd)/data:/data \ - --privileged \ - -e HF_TOKEN=${HF_TOKEN} \ - ghcr.io/huggingface/neuronx-tgi:latest \ - -``` - -If you only want to map the first device, the launch command becomes: - -``` -docker run -p 8080:80 \ - -v $(pwd)/data:/data \ - --device=/dev/neuron0 \ - -e HF_TOKEN=${HF_TOKEN} \ - ghcr.io/huggingface/neuronx-tgi:latest \ - -``` - -### Using a standard model from the 🤗 [HuggingFace Hub](https://huggingface.co/aws-neuron) (recommended) - -We maintain a Neuron Model Cache of the most popular architecture and deployment parameters under [aws-neuron/optimum-neuron-cache](https://huggingface.co/aws-neuron/optimum-neuron-cache). - -If you just want to try the service quickly using a model that has not been exported yet, it is thus still -possible to export it dynamically, pending some conditions: -- you must specify the export parameters when launching the service (or use default parameters), -- the model configuration must be cached. - -The snippet below shows how you can deploy a service from a hub standard model: - -``` -export HF_TOKEN= -docker run -p 8080:80 \ - -v $(pwd)/data:/data \ - --privileged \ - -e HF_TOKEN=${HF_TOKEN} \ - -e HF_AUTO_CAST_TYPE="fp16" \ - -e HF_NUM_CORES=2 \ - ghcr.io/huggingface/neuronx-tgi:latest \ - --model-id meta-llama/Meta-Llama-3-8B \ - --max-batch-size 1 \ - --max-input-length 3164 \ - --max-total-tokens 4096 -``` - -### Using a model exported to a local path - -Alternatively, you can first [export the model to neuron format](https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-neuronx-tgi) locally. - -You can then deploy the service inside the shared volume: - -``` -docker run -p 8080:80 \ - -v $(pwd)/data:/data \ - --privileged \ - ghcr.io/huggingface/neuronx-tgi:latest \ - --model-id /data/ -``` - -Note: You don't need to specify any service parameters, as they will all be deduced from the model export configuration. - -### Using a neuron model from the 🤗 [HuggingFace Hub](https://huggingface.co/) - -The easiest way to share a neuron model inside your organization is to push it on the Hugging Face hub, so that it can be deployed directly without requiring an export. - -The snippet below shows how you can deploy a service from a hub neuron model: - -``` -docker run -p 8080:80 \ - -v $(pwd)/data:/data \ - --privileged \ - -e HF_TOKEN=${HF_TOKEN} \ - ghcr.io/huggingface/neuronx-tgi:latest \ - --model-id / -``` - -### Choosing service parameters - -Use the following command to list the available service parameters: - -``` -docker run ghcr.io/huggingface/neuronx-tgi --help -``` - -The configuration of an inference endpoint is always a compromise between throughput and latency: serving more requests in parallel will allow a higher throughput, but it will increase the latency. - -The neuron models have static input dimensions `[batch_size, max_length]`. - -This adds several restrictions to the following parameters: - -- `--max-batch-size` must be set to `batch size`, -- `--max-input-length` must be lower than `max_length`, -- `--max-total-tokens` must be set to `max_length` (it is per-request). - -Although not strictly necessary, but important for efficient prefilling: - -- `--max-batch-prefill-tokens` should be set to `batch_size` * `max-input-length`. - -### Choosing the correct batch size - -As seen in the previous paragraph, neuron model static batch size has a direct influence on the endpoint latency and throughput. - -Please refer to [text-generation-inference](https://github.com/huggingface/text-generation-inference) for optimization hints. - -Note that the main constraint is to be able to fit the model for the specified `batch_size` within the total device memory available -on your instance (16GB per neuron core, with 2 cores per device). - -## Query the service - -You can query the model using either the `/generate` or `/generate_stream` routes: - -``` -curl 127.0.0.1:8080/generate \ - -X POST \ - -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ - -H 'Content-Type: application/json' -``` - -``` -curl 127.0.0.1:8080/generate_stream \ - -X POST \ - -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ - -H 'Content-Type: application/json' -``` - -Note: replace 127.0.0.1:8080 with your actual IP address and port. +A [neuron backend](https://huggingface.co/docs/text-generation-inference/en/backends/neuron) allows to deploy TGI for Trainium and Inferentia chips.