Skip to content

Commit

Permalink
docs: streaming documentation (#5980)
Browse files Browse the repository at this point in the history
Co-authored-by: Scott Martens <[email protected]>
  • Loading branch information
alaeddine-13 and scott-martens authored Jul 27, 2023
1 parent 4af0308 commit e51ddca
Show file tree
Hide file tree
Showing 4 changed files with 156 additions and 6 deletions.
115 changes: 114 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,6 @@ class StableLM(Executor):
for prompt, output in zip(prompts, llm_outputs):
generations.append(Generation(prompt=prompt, text=output))
return generations

```

</td>
Expand Down Expand Up @@ -342,6 +341,120 @@ response[0].display()

<!-- end build-pipelines -->

### Streaming for LLMs
<!-- start llm-streaming-intro -->
Large Language Models can power a wide range of applications from chatbots to assistants and intelligent systems.
However, these models can be heavy and slow and your users want systems that are both intelligent _and_ fast!

Large language models work by turning your questions into tokens and then generating new token one at a
time until it decides that generation should stop.
This means you want to **stream** the output tokens generated by a large language model to the client.
In this tutorial, we will discuss how to achieve this with Streaming Endpoints in Jina.
<!-- end llm-streaming-intro -->

#### Service Schemas
<!-- start llm-streaming-schemas -->
The first step is to define the streaming service schemas, as you would do in any other service framework.
The input to the service is the prompt and the maximum number of tokens to generate, while the output is simply the
token ID:
```python
from docarray import BaseDoc
class PromptDocument(BaseDoc):
prompt: str
max_tokens: int
class TokenDocument(BaseDoc):
token_id: int
```
<!-- end llm-streaming-schemas -->

#### Service initialization
<!-- start llm-streaming-init -->
Our service depends on a large language model. As an example, we will use the `gpt2` model. This is how you would load
such a model in your executor
```python
from jina import Executor, requests
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
class TokenStreamingExecutor(Executor):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.model = GPT2LMHeadModel.from_pretrained('gpt2')
```
<!-- end llm-streaming-init -->


#### Implement the streaming endpoint
<!-- start llm-streaming-endpoint -->
Our streaming endpoint accepts a `PromptDocument` as input and streams `TokenDocument`s. To stream a document back to
the client, use the `yield` keyword in the endpoint implementation. Therefore, we use the model to generate
up to `max_tokens` tokens and yield them until the generation stops:
```python
@requests(on='/stream')
async def task(self, doc: PromptDocument, **kwargs) -> TokenDocument:
encoded_input = tokenizer(doc.prompt, return_tensors='pt')
for _ in range(doc.max_tokens):
output = self.model.generate(**encoded_input, max_new_tokens=1)
if output[0][-1] == tokenizer.eos_token_id:
break
yield TokenDocument(token_id=output[0][-1])
encoded_input = {
'input_ids': output,
'attention_mask': torch.ones(1, len(output[0])),
}
```

Learn more about {ref}`streaming endpoints <streaming-endpoints>` from the `Executor` documentation.
<!-- end llm-streaming-endpoint -->


#### Serve and send requests
<!-- start llm-streaming-serve -->

The final step is to serve the Executor and send requests using the client.
To serve the Executor using gRPC:
```python
from jina import Deployment
with Deployment(
uses=TokenStreamingExecutor, port=12345, protocol='grpc', include_gateway=False
) as dep:
dep.block()
```

To send requests from a client:
```python
import asyncio
from jina import Client
async def main():
client = Client(port=12345, protocol='grpc', asyncio=True)
tokens = []
async for doc in client.stream_doc(
on='/stream',
inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10),
return_type=TokenDocument,
):
tokens.append(doc.token_id)
print(tokenizer.decode(tokens, skip_special_tokens=True))
asyncio.run(main())
```

```text
The capital of France is Paris.
```
<!-- end llm-streaming-serve -->

### Easy scalability and concurrency

Why not just use standard Python to build that microservice and pipeline? Jina accelerates time to market of your application by making it more scalable and cloud-native. Jina also handles the infrastructure complexity in production and other Day-2 operations so that you can focus on the data application itself.
Expand Down
8 changes: 3 additions & 5 deletions docs/concepts/serving/executor/add-endpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,10 +274,7 @@ class MyDocument(BaseDoc):
class MyExecutor(Executor):

@requests(on='/hello')
async def task(self, doc: MyDocument, **kwargs):
print()
# for doc in docs:
# doc.text = 'hello world'
async def task(self, doc: MyDocument, **kwargs) -> MyDocument:
for i in range(100):
yield MyDocument(text=f'hello world {i}')

Expand All @@ -296,9 +293,10 @@ Jina offers a standard python client for using the streaming endpoint:

```python
from jina import Client
from docarray import DocList
client = Client(port=12345, protocol='http', cors=True, asyncio=True) # or protocol='grpc'
async for doc in client.stream_doc(
on='/hello', inputs=MyDocument(text='hello world'), return_type=DocList[MyDocument]
on='/hello', inputs=MyDocument(text='hello world'), return_type=MyDocument
):
print(doc.text)
```
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@ docarray-support
tutorials/deploy-model
tutorials/gpu-executor
tutorials/deploy-pipeline
tutorials/llm-serve
```

```{toctree}
Expand Down
38 changes: 38 additions & 0 deletions docs/tutorials/llm-serve.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Build a Streaming API for a Large Language Model
```{include} ../../README.md
:start-after: <!-- start llm-streaming-intro -->
:end-before: <!-- end llm-streaming-intro -->
```

## Service Schemas
```{include} ../../README.md
:start-after: <!-- start llm-streaming-schemas -->
:end-before: <!-- end llm-streaming-schemas -->
```

```{admonition} Note
:class: note
Thanks to DocArray's flexibility, you can implement very flexible services. For instance, you can use
Tensor types to efficiently stream token logits back to the client and implement complex token sampling strategies on
the client side.
```

## Service initialization
```{include} ../../README.md
:start-after: <!-- start llm-streaming-init -->
:end-before: <!-- end llm-streaming-init -->
```

## Implement the streaming endpoint

```{include} ../../README.md
:start-after: <!-- start llm-streaming-endpoint -->
:end-before: <!-- end llm-streaming-endpoint -->
```

## Serve and send requests
```{include} ../../README.md
:start-after: <!-- start llm-streaming-serve -->
:end-before: <!-- end llm-streaming-serve -->
```

0 comments on commit e51ddca

Please sign in to comment.