From bf51ec900f700dbc27f7c7f1697c9faefabe0ab8 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 14 Nov 2023 18:37:25 +0100 Subject: [PATCH] - Added docs on the new completion service (reflecting the changes in `0.12.3rc1`) --- docs/docs/guides/fine-tuning.md | 45 ++++++++------- docs/docs/guides/services.md | 10 +++- docs/docs/guides/text-generation.md | 87 +++++++++++++++++++++++++++++ docs/docs/index.md | 8 +-- mkdocs.yml | 1 + 5 files changed, 121 insertions(+), 30 deletions(-) create mode 100644 docs/docs/guides/text-generation.md diff --git a/docs/docs/guides/fine-tuning.md b/docs/docs/guides/fine-tuning.md index 11f2330a0..bc331be4b 100644 --- a/docs/docs/guides/fine-tuning.md +++ b/docs/docs/guides/fine-tuning.md @@ -1,11 +1,7 @@ # Fine-tuning -For fine-tuning an LLM with dstack's API, specify a model name, HuggingFace dataset, and training parameters. - -You specify a model name, dataset on HuggingFace, and training parameters. -`dstack` takes care of the training and pushes it to the HuggingFace hub upon completion. - -You can use any cloud GPU provider(s) and experiment tracker of your choice. +For fine-tuning an LLM with `dstack`'s API, specify a model, dataset, training parameters, +and required compute resources. `dstack` takes care of everything else. ??? info "Prerequisites" To use the fine-tuning API, ensure you have the latest version: @@ -39,12 +35,14 @@ and various [training parameters](../../docs/reference/api/python/index.md#dstac ```python from dstack.api import FineTuningTask -task = FineTuningTask(model_name="NousResearch/Llama-2-13b-hf", - dataset_name="peterschmidt85/samsum", - env={ - "HUGGING_FACE_HUB_TOKEN": "...", - }, - num_train_epochs=2) +task = FineTuningTask( + model_name="NousResearch/Llama-2-13b-hf", + dataset_name="peterschmidt85/samsum", + env={ + "HUGGING_FACE_HUB_TOKEN": "...", + }, + num_train_epochs=2 +) ``` !!! info "Dataset format" @@ -52,9 +50,9 @@ task = FineTuningTask(model_name="NousResearch/Llama-2-13b-hf", of the corresponding model. Check the [peterschmidt85/samsum](https://huggingface.co/datasets/peterschmidt85/samsum) example. -## Submit the task +## Run the task -When submitting a task, you can configure resources, and many [other options](../../docs/reference/api/python/index.md#dstack.api.RunCollection.submit). +When running a task, you can configure resources, and many [other options](../../docs/reference/api/python/index.md#dstack.api.RunCollection.submit). ```python from dstack.api import Resources, GPU @@ -83,15 +81,16 @@ including getting a list of runs, stopping a given run, etc. To track experiment metrics, specify `report_to` and related authentication environment variables. ```python -task = FineTuningTask(model_name="NousResearch/Llama-2-13b-hf", - dataset_name="peterschmidt85/samsum", - report_to="wandb", - env={ - "HUGGING_FACE_HUB_TOKEN": "...", - "WANDB_API_KEY": "...", - }, - num_train_epochs=2 - ) +task = FineTuningTask( + model_name="NousResearch/Llama-2-13b-hf", + dataset_name="peterschmidt85/samsum", + report_to="wandb", + env={ + "HUGGING_FACE_HUB_TOKEN": "...", + "WANDB_API_KEY": "...", + }, + num_train_epochs=2 +) ``` Currently, the API supports `"tensorboard"` and `"wandb"`. diff --git a/docs/docs/guides/services.md b/docs/docs/guides/services.md index 8434b1796..5301418e4 100644 --- a/docs/docs/guides/services.md +++ b/docs/docs/guides/services.md @@ -5,11 +5,12 @@ Provide the commands, port, and choose the Python version or a Docker image. `dstack` handles the deployment on configured cloud GPU provider(s) with the necessary resources. -## Prerequisites +??? info "Prerequisites" -If you're using the open-source server, you first have to set up a gateway. + If you're using the open-source server, you first have to set up a gateway. + + ### Set up a gateway -??? info "Set up a gateway" For example, if your domain is `example.com`, go ahead and run the `dstack gateway create` command: @@ -93,6 +94,9 @@ Serving HTTP on https://yellow-cat-1.example.com ... +Once the service is deployed, its endpoint will be available at +`https://.` (using the domain set up for the gateway). + !!! info "Run options" The `dstack run` command allows you to use `--gpu` to request GPUs (e.g. `--gpu A100` or `--gpu 80GB` or `--gpu A100:4`, etc.), and many other options (incl. spot instances, max price, max duration, retry policy, etc.). diff --git a/docs/docs/guides/text-generation.md b/docs/docs/guides/text-generation.md new file mode 100644 index 000000000..3d7819203 --- /dev/null +++ b/docs/docs/guides/text-generation.md @@ -0,0 +1,87 @@ +# Text generation + +For deploying an LLM with `dstack`'s API, specify a model, quantization parameters, +and required compute resources. `dstack` takes care of everything else. + +??? info "Prerequisites" + If you're using the open-source server, before using the model serving API, make sure to + [set up a gateway](services.md#set-up-a-gateway). + + If you're using the cloud version of `dstack`, it's set up automatically for you. + + Also, to use the model serving API, ensure you have the latest version: + +
+ + ```shell + $ pip install "dstack[all]==0.12.3rc1" + ``` + +
+ +## Create a client + +First, you connect to `dstack`: + +```python +from dstack.api import Client, ClientError + +try: + client = Client.from_config() +except ClientError: + print("Can't connect to the server") +``` + +## Create a service + +Then, you create a completion service, specifying the model, and quantization parameters. + +```python +from dstack.api import CompletionService + +service = CompletionService( + model_name="TheBloke/CodeLlama-34B-GPTQ", + quantize="gptq" +) +``` + +## Run the service + +When running a service, you can configure resources, and many [other options](../../docs/reference/api/python/index.md#dstack.api.RunCollection.submit). + +```python +from dstack.api import Resources, GPU + +run = client.runs.submit( + run_name="codellama-34b-gptq", # (Optional) If unset, its chosen randomly + configuration=service, + resources=Resources(gpu=GPU(memory="24GB")), +) +``` + +## Access the endpoint + +Once the model is deployed, its endpoint will be available at +`https://.` (using the domain set up for the gateway). + +
+ +```shell +$ curl https://<run-name>.<domain-name>/generate \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens": 20}}' \ + -H 'Content-Type: application/json' +``` + +
+ +> The endpoint supports streaming, continuous batching, tensor parallelism, etc. + +The OpenAI documentation on the endpoint can be found at `https://./docs`. + +[//]: # (TODO: LangChain, own client) + +## Manage runs + +You can use the instance of [`dstack.api.Client`](../../docs/reference/api/python/index.md#dstack.api.Client) to manage your runs, +including getting a list of runs, stopping a given run, etc. \ No newline at end of file diff --git a/docs/docs/index.md b/docs/docs/index.md index 98c9675c7..e0ec8923d 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -120,7 +120,7 @@ client = Client.from_config() model_name="NousResearch/Llama-2-13b-hf", dataset_name="peterschmidt85/samsum", env={ - "WANDB_API_KEY": "..." + "HUGGING_FACE_HUB_TOKEN": "..." }, num_train_epochs=2 ) @@ -135,7 +135,7 @@ client = Client.from_config() > Go to [Fine-tuning](guides/fine-tuning.md) to learn more. -=== "Model serving" +=== "Text generation" ```python from dstack.api import Client, GPU, CompletionService, Resources @@ -150,13 +150,13 @@ client = Client.from_config() # Deploy the model as a public endpoint run = client.runs.submit( - run_name = "llama-2-13b-hf", # If not set, assigned randomly + run_name = "codellama-34b-gptq", # If not set, assigned randomly configuration=service, resources=Resources(gpu=GPU(memory="24GB")) ) ``` -[//]: # ( > Go to [Text generation](guides/text-generation.md) to learn more.) + > Go to [Text generation](guides/text-generation.md) to learn more. ## Using the CLI diff --git a/mkdocs.yml b/mkdocs.yml index 96c1a650f..c588d62d5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -163,6 +163,7 @@ nav: - Server configuration: docs/configuration/server.md - Guides: - Fine-tuning: docs/guides/fine-tuning.md + - Text generation: docs/guides/text-generation.md - Dev environments: docs/guides/dev-environments.md - Tasks: docs/guides/tasks.md - Services: docs/guides/services.md