From bf51ec900f700dbc27f7c7f1697c9faefabe0ab8 Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Tue, 14 Nov 2023 18:37:25 +0100
Subject: [PATCH] - Added docs on the new completion service (reflecting the
 changes in `0.12.3rc1`)

---
 docs/docs/guides/fine-tuning.md     | 45 ++++++++-------
 docs/docs/guides/services.md        | 10 +++-
 docs/docs/guides/text-generation.md | 87 +++++++++++++++++++++++++++++
 docs/docs/index.md                  |  8 +--
 mkdocs.yml                          |  1 +
 5 files changed, 121 insertions(+), 30 deletions(-)
 create mode 100644 docs/docs/guides/text-generation.md

diff --git a/docs/docs/guides/fine-tuning.md b/docs/docs/guides/fine-tuning.md
index 11f2330a0..bc331be4b 100644
--- a/docs/docs/guides/fine-tuning.md
+++ b/docs/docs/guides/fine-tuning.md
@@ -1,11 +1,7 @@
 # Fine-tuning
 
-For fine-tuning an LLM with dstack's API, specify a model name, HuggingFace dataset, and training parameters.
-
-You specify a model name, dataset on HuggingFace, and training parameters.
-`dstack` takes care of the training and pushes it to the HuggingFace hub upon completion. 
-
-You can use any cloud GPU provider(s) and experiment tracker of your choice.
+For fine-tuning an LLM with `dstack`'s API, specify a model, dataset, training parameters,
+and required compute resources. `dstack` takes care of everything else.
 
 ??? info "Prerequisites"
     To use the fine-tuning API, ensure you have the latest version:
@@ -39,12 +35,14 @@ and various [training parameters](../../docs/reference/api/python/index.md#dstac
 ```python
 from dstack.api import FineTuningTask
 
-task = FineTuningTask(model_name="NousResearch/Llama-2-13b-hf",
-                      dataset_name="peterschmidt85/samsum",
-                      env={
-                          "HUGGING_FACE_HUB_TOKEN": "...",
-                      },
-                      num_train_epochs=2)
+task = FineTuningTask(
+    model_name="NousResearch/Llama-2-13b-hf",
+    dataset_name="peterschmidt85/samsum",
+    env={
+        "HUGGING_FACE_HUB_TOKEN": "...",
+    },
+    num_train_epochs=2
+)
 ```
 
 !!! info "Dataset format"
@@ -52,9 +50,9 @@ task = FineTuningTask(model_name="NousResearch/Llama-2-13b-hf",
     of the corresponding model.
     Check the [peterschmidt85/samsum](https://huggingface.co/datasets/peterschmidt85/samsum) example. 
 
-## Submit the task
+## Run the task
 
-When submitting a task, you can configure resources, and many [other options](../../docs/reference/api/python/index.md#dstack.api.RunCollection.submit).
+When running a task, you can configure resources, and many [other options](../../docs/reference/api/python/index.md#dstack.api.RunCollection.submit).
 
 ```python
 from dstack.api import Resources, GPU
@@ -83,15 +81,16 @@ including getting a list of runs, stopping a given run, etc.
 To track experiment metrics, specify `report_to` and related authentication environment variables.
 
 ```python
-task = FineTuningTask(model_name="NousResearch/Llama-2-13b-hf",
-                      dataset_name="peterschmidt85/samsum",
-                      report_to="wandb",
-                      env={
-                          "HUGGING_FACE_HUB_TOKEN": "...",
-                          "WANDB_API_KEY": "...",
-                      },
-                      num_train_epochs=2
-                      )
+task = FineTuningTask(
+    model_name="NousResearch/Llama-2-13b-hf",
+    dataset_name="peterschmidt85/samsum",
+    report_to="wandb",
+    env={
+        "HUGGING_FACE_HUB_TOKEN": "...",
+        "WANDB_API_KEY": "...",
+    },
+    num_train_epochs=2
+)
 ```
 
 Currently, the API supports `"tensorboard"` and `"wandb"`.
diff --git a/docs/docs/guides/services.md b/docs/docs/guides/services.md
index 8434b1796..5301418e4 100644
--- a/docs/docs/guides/services.md
+++ b/docs/docs/guides/services.md
@@ -5,11 +5,12 @@ Provide the commands, port, and choose the Python version or a Docker image.
 
 `dstack` handles the deployment on configured cloud GPU provider(s) with the necessary resources.
 
-## Prerequisites
+??? info "Prerequisites"
 
-If you're using the open-source server, you first have to set up a gateway.
+    If you're using the open-source server, you first have to set up a gateway.
+
+    ### Set up a gateway
 
-??? info "Set up a gateway"
     For example, if your domain is `example.com`, go ahead and run the 
     `dstack gateway create` command:
     
@@ -93,6 +94,9 @@ Serving HTTP on https://yellow-cat-1.example.com ...
 
 </div>
 
+Once the service is deployed, its endpoint will be available at
+`https://<run-name>.<domain-name>` (using the domain set up for the gateway).
+
 !!! info "Run options"
     The `dstack run` command allows you to use `--gpu` to request GPUs (e.g. `--gpu A100` or `--gpu 80GB` or `--gpu A100:4`, etc.),
     and many other options (incl. spot instances, max price, max duration, retry policy, etc.).
diff --git a/docs/docs/guides/text-generation.md b/docs/docs/guides/text-generation.md
new file mode 100644
index 000000000..3d7819203
--- /dev/null
+++ b/docs/docs/guides/text-generation.md
@@ -0,0 +1,87 @@
+# Text generation
+
+For deploying an LLM with `dstack`'s API, specify a model, quantization parameters, 
+and required compute resources. `dstack` takes care of everything else.
+
+??? info "Prerequisites"
+    If you're using the open-source server, before using the model serving API, make sure to
+    [set up a gateway](services.md#set-up-a-gateway).
+
+    If you're using the cloud version of `dstack`, it's set up automatically for you.
+
+    Also, to use the model serving API, ensure you have the latest version:
+
+    <div class="termy">
+
+    ```shell
+    $ pip install "dstack[all]==0.12.3rc1"
+    ```
+
+    </div>
+
+## Create a client
+
+First, you connect to `dstack`:
+
+```python
+from dstack.api import Client, ClientError
+
+try:
+    client = Client.from_config()
+except ClientError:
+    print("Can't connect to the server")
+```
+
+## Create a service
+
+Then, you create a completion service, specifying the model, and quantization parameters.
+
+```python
+from dstack.api import CompletionService
+
+service = CompletionService(
+    model_name="TheBloke/CodeLlama-34B-GPTQ",
+    quantize="gptq"
+)
+```
+
+## Run the service
+
+When running a service, you can configure resources, and many [other options](../../docs/reference/api/python/index.md#dstack.api.RunCollection.submit).
+
+```python
+from dstack.api import Resources, GPU
+
+run = client.runs.submit(
+    run_name="codellama-34b-gptq", # (Optional) If unset, its chosen randomly
+    configuration=service,
+    resources=Resources(gpu=GPU(memory="24GB")),
+)
+```
+
+## Access the endpoint
+
+Once the model is deployed, its endpoint will be available at
+`https://<run-name>.<domain-name>` (using the domain set up for the gateway).
+
+<div class="termy">
+
+```shell
+$ curl https://&lt;run-name&gt;.&lt;domain-name&gt;/generate \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens": 20}}' \
+    -H 'Content-Type: application/json'
+```
+
+</div>
+
+> The endpoint supports streaming, continuous batching, tensor parallelism, etc.
+
+The OpenAI documentation on the endpoint can be found at `https://<run-name>.<domain-name>/docs`.
+
+[//]: # (TODO: LangChain, own client)
+
+## Manage runs
+
+You can use the instance of [`dstack.api.Client`](../../docs/reference/api/python/index.md#dstack.api.Client) to manage your runs, 
+including getting a list of runs, stopping a given run, etc.
\ No newline at end of file
diff --git a/docs/docs/index.md b/docs/docs/index.md
index 98c9675c7..e0ec8923d 100644
--- a/docs/docs/index.md
+++ b/docs/docs/index.md
@@ -120,7 +120,7 @@ client = Client.from_config()
         model_name="NousResearch/Llama-2-13b-hf",
         dataset_name="peterschmidt85/samsum",
         env={
-            "WANDB_API_KEY": "..."
+            "HUGGING_FACE_HUB_TOKEN": "..."
         },
         num_train_epochs=2
     )
@@ -135,7 +135,7 @@ client = Client.from_config()
 
     > Go to [Fine-tuning](guides/fine-tuning.md) to learn more.
 
-=== "Model serving"
+=== "Text generation"
 
     ```python
     from dstack.api import Client, GPU, CompletionService, Resources
@@ -150,13 +150,13 @@ client = Client.from_config()
     # Deploy the model as a public endpoint
 
     run = client.runs.submit(
-        run_name = "llama-2-13b-hf",  # If not set, assigned randomly
+        run_name = "codellama-34b-gptq",  # If not set, assigned randomly
         configuration=service,
         resources=Resources(gpu=GPU(memory="24GB"))
     )
     ```
 
-[//]: # (    > Go to [Text generation]&#40;guides/text-generation.md&#41; to learn more.)
+    > Go to [Text generation](guides/text-generation.md) to learn more.
 
 ## Using the CLI
 
diff --git a/mkdocs.yml b/mkdocs.yml
index 96c1a650f..c588d62d5 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -163,6 +163,7 @@ nav:
               - Server configuration: docs/configuration/server.md
       - Guides:
           - Fine-tuning: docs/guides/fine-tuning.md
+          - Text generation: docs/guides/text-generation.md
           - Dev environments: docs/guides/dev-environments.md
           - Tasks: docs/guides/tasks.md
           - Services: docs/guides/services.md