doc(tgi): reference the main documentation

huggingface · Feb 24, 2025 · ccf3b45 · ccf3b45
1 parent 72b35b9
commit ccf3b45
Showing 1 changed file with 1 addition and 180 deletions.
diff --git a/docs/source/guides/neuronx_tgi.mdx b/docs/source/guides/neuronx_tgi.mdx
@@ -2,183 +2,4 @@
 
 Text Generation Inference ([TGI](https://huggingface.co/docs/text-generation-inference/)) is a toolkit for deploying and serving Large Language Models (LLMs).
 
-It is available for Inferentia2.
-
-## Features
-
-The basic TGI features are supported:
-
-- continuous batching,
-- token streaming,
-- greedy search and multinomial sampling using [transformers](https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation).
-
-## License
-
-NeuronX TGI is released under an [Apache2 License](https://github.com/huggingface/text-generation-inference?tab=Apache-2.0-1-ov-file#readme).
-
-## Deploy the service from the Hugging Face hub
-
-The simplest way to deploy the NeuronX TGI service for a specific model is to follow the
-deployment instructions in the model card:
-
-- click on the "Deploy" button on the right,
-- select your deployment service ("Inference Endpoints" and "SageMaker" are supported),
-- select "AWS Inferentia",
-- follow the instructions.
-
-
-## Deploy the service on a dedicated host
-
-The service is launched simply by running the neuronx-tgi container with two sets of parameters:
-
-```
-docker run <system_parameters> ghcr.io/huggingface/neuronx-tgi:latest <service_parameters>
-```
-
-- system parameters are used to map ports, volumes and devices between the host and the service,
-- service parameters are forwarded to the `text-generation-launcher`.
-
-When deploying a service, you will need a pre-compiled Neuron model. The NeuronX TGI service supports two main modes of operation:
-
-- you can either deploy the service on a model that has already been exported to Neuron,
-- or alternatively you can take advantage of the Neuron Model Cache to export your own model.
-
-### Common system parameters
-
-Whenever you launch a TGI service, we highly recommend you to mount a shared volume mounted as `/data` in the container: this is where
-the models will be cached to speed up further instantiations of the service.
-
-Note also that enough neuron devices should be visible by the container.The simplest way to achieve that is to launch the service in `privileged` mode to get access to all neuron devices.
-Alternatively, each device can be explicitly exposed using the `--device` option.
-
-Finally, you might want to export the `HF_TOKEN` if you want to access gated repositories.
-
-Here is an example of a service instantiation:
-
-```
-docker run -p 8080:80 \
-       -v $(pwd)/data:/data \
-       --privileged \
-       -e HF_TOKEN=${HF_TOKEN} \
-       ghcr.io/huggingface/neuronx-tgi:latest \
-       <service_parameters>
-```
-
-If you only want to map the first device, the launch command becomes:
-
-```
-docker run -p 8080:80 \
-       -v $(pwd)/data:/data \
-       --device=/dev/neuron0 \
-       -e HF_TOKEN=${HF_TOKEN} \
-       ghcr.io/huggingface/neuronx-tgi:latest \
-       <service_parameters>
-```
-
-### Using a standard model from the 🤗 [HuggingFace Hub](https://huggingface.co/aws-neuron) (recommended)
-
-We maintain a Neuron Model Cache of the most popular architecture and deployment parameters under [aws-neuron/optimum-neuron-cache](https://huggingface.co/aws-neuron/optimum-neuron-cache).
-
-If you just want to try the service quickly using a model that has not been exported yet, it is thus still
-possible to export it dynamically, pending some conditions:
-- you must specify the export parameters when launching the service (or use default parameters),
-- the model configuration must be cached.
-
-The snippet below shows how you can deploy a service from a hub standard model:
-
-```
-export HF_TOKEN=<YOUR_TOKEN>
-docker run -p 8080:80 \
-       -v $(pwd)/data:/data \
-       --privileged \
-       -e HF_TOKEN=${HF_TOKEN} \
-       -e HF_AUTO_CAST_TYPE="fp16" \
-       -e HF_NUM_CORES=2 \
-       ghcr.io/huggingface/neuronx-tgi:latest \
-       --model-id meta-llama/Meta-Llama-3-8B \
-       --max-batch-size 1 \
-       --max-input-length 3164 \
-       --max-total-tokens 4096
-```
-
-### Using a model exported to a local path
-
-Alternatively, you can first [export the model to neuron format](https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-neuronx-tgi) locally.
-
-You can then deploy the service inside the shared volume:
-
-```
-docker run -p 8080:80 \
-       -v $(pwd)/data:/data \
-       --privileged \
-       ghcr.io/huggingface/neuronx-tgi:latest \
-       --model-id /data/<neuron_model_path>
-```
-
-Note: You don't need to specify any service parameters, as they will all be deduced from the model export configuration.
-
-### Using a neuron model from the 🤗 [HuggingFace Hub](https://huggingface.co/)
-
-The easiest way to share a neuron model inside your organization is to push it on the Hugging Face hub, so that it can be deployed directly without requiring an export.
-
-The snippet below shows how you can deploy a service from a hub neuron model:
-
-```
-docker run -p 8080:80 \
-       -v $(pwd)/data:/data \
-       --privileged \
-       -e HF_TOKEN=${HF_TOKEN} \
-       ghcr.io/huggingface/neuronx-tgi:latest \
-       --model-id <organization>/<neuron-model>
-```
-
-### Choosing service parameters
-
-Use the following command to list the available service parameters:
-
-```
-docker run ghcr.io/huggingface/neuronx-tgi --help
-```
-
-The configuration of an inference endpoint is always a compromise between throughput and latency: serving more requests in parallel will allow a higher throughput, but it will increase the latency.
-
-The neuron models have static input dimensions `[batch_size, max_length]`.
-
-This adds several restrictions to the following parameters:
-
-- `--max-batch-size` must be set to `batch size`,
-- `--max-input-length` must be lower than `max_length`,
-- `--max-total-tokens` must be set to `max_length` (it is per-request).
-
-Although not strictly necessary, but important for efficient prefilling:
-
-- `--max-batch-prefill-tokens` should be set to `batch_size` * `max-input-length`.
-
-### Choosing the correct batch size
-
-As seen in the previous paragraph, neuron model static batch size has a direct influence on the endpoint latency and throughput.
-
-Please refer to [text-generation-inference](https://github.com/huggingface/text-generation-inference) for optimization hints.
-
-Note that the main constraint is to be able to fit the model for the specified `batch_size` within the total device memory available
-on your instance (16GB per neuron core, with 2 cores per device).
-
-## Query the service
-
-You can query the model using either the `/generate` or `/generate_stream` routes:
-
-```
-curl 127.0.0.1:8080/generate \
-    -X POST \
-    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-    -H 'Content-Type: application/json'
-```
-
-```
-curl 127.0.0.1:8080/generate_stream \
-    -X POST \
-    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-    -H 'Content-Type: application/json'
-```
-
-Note: replace 127.0.0.1:8080 with your actual IP address and port.
+A [neuron backend](https://huggingface.co/docs/text-generation-inference/en/backends/neuron) allows to deploy TGI for Trainium and Inferentia chips.