Skip to content

Commit

Permalink
Added limitations sections + some clarifications
Browse files Browse the repository at this point in the history
  • Loading branch information
oandreeva-nv committed Oct 25, 2024
1 parent dd4de13 commit 0ec9015
Showing 1 changed file with 38 additions and 16 deletions.
54 changes: 38 additions & 16 deletions Conceptual_Guide/Part_8-semantic_caching/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,11 @@ This approach offers several benefits including, but not limited to:

+ **Scalability**

- The improved resource efficiency and reduced computational demands allows
applications to serve more users without a proportional increase in
infrastructure costs.
- As the user base and the volume of queries grow, the probability of cache
hits increases, provided that there is adequate storage and resources
available to support this scaling. The improved resource efficiency and
reduced computational demands allows applications to serve more users
without a proportional increase in infrastructure costs.

+ **Consistency in Responses**

Expand Down Expand Up @@ -130,22 +132,34 @@ provide the necessary codebase to implement our semantic caching example.

```bash
git clone https://github.com/triton-inference-server/vllm_backend.git
cd vllm_backend
```

With the repository cloned, the next step is to add the
[semantic_caching.py.](./artifacts/semantic_caching.py) script to
the appropriate directory. This script contains the logic for our semantic
caching implementation.
With the repository successfully cloned, the next step is to apply all
necessary modifications. To simplify this process, we've prepared a
[semantic_cache.patch](tutorials/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_cache.patch)
that consolidates all changes into a single step:

```bash
wget -P vllm_backend/src/utils/ https://raw.githubusercontent.com/triton-inference-server/tutorials/refs/heads/main/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_caching.py
curl https://raw.githubusercontent.com/triton-inference-server/tutorials/refs/heads/main/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_cache.patch | git apply -v
```

Now that we have added the semantic caching script, let's proceed by making
some adjustments in `vllm_backend/src/model.py`. These changes will integrate
the semantic caching functionality into the model.
If you're eager to start using Triton with the optimized vLLM backend,
you can skip ahead to the
[Launching Triton with Optimized vLLM Backend](#launching-triton-with-optimized-vllm-backend)
section. However, for those interested in understanding the specifics,
let's explore what this patch includes.

First, ensure that you import the necessary classes from `semantic_caching.py`:
The patch introduces a new script,
[semantic_caching.py.](./artifacts/semantic_caching.py), which is added to the
appropriate directory. This script implements the core logic for our
semantic caching functionality.

Next, the patch integrates semantic caching into the model. Let's walk through
these changes step-by-step.

Firstly, it imports the necessary classes from
[semantic_caching.py.](./artifacts/semantic_caching.py) into the codebase:

```diff
...
Expand All @@ -154,7 +168,7 @@ from utils.metrics import VllmStatLogger
+from utils.semantic_caching import SemanticCPUCacheConfig, SemanticCPUCache
```

Next, initialize the semantic cache during the initialization step.
Next, it sets up the semantic cache during the initialization step.
This setup will prepare your model to utilize semantic caching during
its operations.

Expand All @@ -177,9 +191,9 @@ its operations.

```

Finally, we'll add logic to query and update the semantic cache during
request processing. This ensures that cached responses are efficiently utilized
whenever possible.
Finally, the patch incorporates logic to query and update the semantic cache
during request processing. This ensures that cached responses are efficiently
utilized whenever possible.

```diff
async def generate(self, request):
Expand Down Expand Up @@ -315,6 +329,14 @@ Clearly, the latter 2 requests are semantically similar to the first one, which
resulted in a cache hit scenario, which reduced the latency of our model from
approx 1.1s to the average of 0.048s per request.

## Current limitations

* The current implementation of the Semantic Cache only considers the prompt
itself for cache hits, without accounting for additional request parameters
such as `max_tokens` and `temperature`. As a result, these parameters are not
included in the cache hit evaluation, which may affect the accuracy of cached
responses when different configurations are used.

## Interested in This Feature?

While this reference implementation provides a glimpse into the potential
Expand Down

0 comments on commit 0ec9015

Please sign in to comment.