docs: streaming documentation (#5980)

Co-authored-by: Scott Martens <[email protected]>
jina-ai · Jul 27, 2023 · e51ddca · e51ddca
1 parent 4af0308
commit e51ddca
Show file tree

Hide file tree

Showing 4 changed files with 156 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -126,7 +126,6 @@ class StableLM(Executor):
         for prompt, output in zip(prompts, llm_outputs):
             generations.append(Generation(prompt=prompt, text=output))
         return generations
-
 ```
 
 </td>
@@ -342,6 +341,120 @@ response[0].display()
 
 <!-- end build-pipelines -->
 
+### Streaming for LLMs
+<!-- start llm-streaming-intro -->
+Large Language Models can power a wide range of applications from chatbots to assistants and intelligent systems.
+However, these models can be heavy and slow and your users want systems that are both intelligent _and_ fast!
+
+Large language models work by turning your questions into tokens and then generating new token one at a 
+time until it decides that generation should stop.
+This means you want to **stream** the output tokens generated by a large language model to the client. 
+In this tutorial, we will discuss how to achieve this with Streaming Endpoints in Jina.
+<!-- end llm-streaming-intro -->
+
+#### Service Schemas
+<!-- start llm-streaming-schemas -->
+The first step is to define the streaming service schemas, as you would do in any other service framework.
+The input to the service is the prompt and the maximum number of tokens to generate, while the output is simply the 
+token ID:
+```python
+from docarray import BaseDoc
+
+
+class PromptDocument(BaseDoc):
+    prompt: str
+    max_tokens: int
+
+
+class TokenDocument(BaseDoc):
+    token_id: int
+```
+<!-- end llm-streaming-schemas -->
+
+#### Service initialization
+<!-- start llm-streaming-init -->
+Our service depends on a large language model. As an example, we will use the `gpt2` model. This is how you would load 
+such a model in your executor
+```python
+from jina import Executor, requests
+from transformers import GPT2Tokenizer, GPT2LMHeadModel
+import torch
+
+tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+
+
+class TokenStreamingExecutor(Executor):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.model = GPT2LMHeadModel.from_pretrained('gpt2')
+```
+<!-- end llm-streaming-init -->
+
+
+#### Implement the streaming endpoint
+<!-- start llm-streaming-endpoint -->
+Our streaming endpoint accepts a `PromptDocument` as input and streams `TokenDocument`s. To stream a document back to 
+the client, use the `yield` keyword in the endpoint implementation. Therefore, we use the model to generate 
+up to `max_tokens` tokens and yield them until the generation stops: 
+```python
+@requests(on='/stream')
+async def task(self, doc: PromptDocument, **kwargs) -> TokenDocument:
+    encoded_input = tokenizer(doc.prompt, return_tensors='pt')
+    for _ in range(doc.max_tokens):
+        output = self.model.generate(**encoded_input, max_new_tokens=1)
+        if output[0][-1] == tokenizer.eos_token_id:
+            break
+        yield TokenDocument(token_id=output[0][-1])
+        encoded_input = {
+            'input_ids': output,
+            'attention_mask': torch.ones(1, len(output[0])),
+        }
+```
+
+Learn more about {ref}`streaming endpoints <streaming-endpoints>` from the `Executor` documentation.
+<!-- end llm-streaming-endpoint -->
+
+
+#### Serve and send requests
+<!-- start llm-streaming-serve -->
+
+The final step is to serve the Executor and send requests using the client.
+To serve the Executor using gRPC:
+```python
+from jina import Deployment
+
+with Deployment(
+    uses=TokenStreamingExecutor, port=12345, protocol='grpc', include_gateway=False
+) as dep:
+    dep.block()
+```
+
+To send requests from a client:
+```python
+import asyncio
+from jina import Client
+
+
+async def main():
+    client = Client(port=12345, protocol='grpc', asyncio=True)
+    tokens = []
+    async for doc in client.stream_doc(
+        on='/stream',
+        inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10),
+        return_type=TokenDocument,
+    ):
+        tokens.append(doc.token_id)
+        print(tokenizer.decode(tokens, skip_special_tokens=True))
+
+
+asyncio.run(main())
+```
+
+```text
+The capital of France is Paris.
+```
+<!-- end llm-streaming-serve -->
+
 ### Easy scalability and concurrency
 
 Why not just use standard Python to build that microservice and pipeline? Jina accelerates time to market of your application by making it more scalable and cloud-native. Jina also handles the infrastructure complexity in production and other Day-2 operations so that you can focus on the data application itself.

diff --git a/docs/concepts/serving/executor/add-endpoints.md b/docs/concepts/serving/executor/add-endpoints.md
@@ -274,10 +274,7 @@ class MyDocument(BaseDoc):
 class MyExecutor(Executor):
 
     @requests(on='/hello')
-    async def task(self, doc: MyDocument, **kwargs):
-        print()
-        # for doc in docs:
-        #     doc.text = 'hello world'
+    async def task(self, doc: MyDocument, **kwargs) -> MyDocument:
         for i in range(100):
             yield MyDocument(text=f'hello world {i}')
 
@@ -296,9 +293,10 @@ Jina offers a standard python client for using the streaming endpoint:
 
 ```python
 from jina import Client
+from docarray import DocList
 client = Client(port=12345, protocol='http', cors=True, asyncio=True) # or protocol='grpc'
 async for doc in client.stream_doc(
-    on='/hello', inputs=MyDocument(text='hello world'), return_type=DocList[MyDocument]
+    on='/hello', inputs=MyDocument(text='hello world'), return_type=MyDocument
 ):
     print(doc.text)
 ```

diff --git a/docs/index.md b/docs/index.md
@@ -165,6 +165,7 @@ docarray-support
 tutorials/deploy-model
 tutorials/gpu-executor
 tutorials/deploy-pipeline
+tutorials/llm-serve
 ```
 
 ```{toctree}

diff --git a/docs/tutorials/llm-serve.md b/docs/tutorials/llm-serve.md
@@ -0,0 +1,38 @@
+# Build a Streaming API for a Large Language Model
+```{include} ../../README.md
+:start-after: <!-- start llm-streaming-intro -->
+:end-before: <!-- end llm-streaming-intro -->
+```
+
+## Service Schemas
+```{include} ../../README.md
+:start-after: <!-- start llm-streaming-schemas -->
+:end-before: <!-- end llm-streaming-schemas -->
+```
+
+```{admonition} Note
+:class: note
+
+Thanks to DocArray's flexibility, you can implement very flexible services. For instance, you can use 
+Tensor types to efficiently stream token logits back to the client and implement complex token sampling strategies on 
+the client side.
+```
+
+## Service initialization
+```{include} ../../README.md
+:start-after: <!-- start llm-streaming-init -->
+:end-before: <!-- end llm-streaming-init -->
+```
+
+## Implement the streaming endpoint
+
+```{include} ../../README.md
+:start-after: <!-- start llm-streaming-endpoint -->
+:end-before: <!-- end llm-streaming-endpoint -->
+```
+
+## Serve and send requests
+```{include} ../../README.md
+:start-after: <!-- start llm-streaming-serve -->
+:end-before: <!-- end llm-streaming-serve -->
+```