Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add Semantic Caching Tutorial #118

Merged
merged 15 commits into from
Oct 26, 2024
Merged

Conversation

oandreeva-nv
Copy link
Collaborator

@oandreeva-nv oandreeva-nv commented Oct 23, 2024

This PR adds reference implementation of local Semantic caching mechanism.
Note:
Adding a CPU based index, since for current tutorial this seems enough. GPU-based index makes more sense when we have large amount of vectors to process.

Opens:

I've added a section called Interested in This Feature?, which is not finished at the moment.

My suggestion for community engagement is to create a GitHub issue and encourage readers and users to vote on it, in case it is interested.

If there're no objections, I'll proceed with this idea.

[Edit 1] discussion opened -> triton-inference-server/server#7742

supported feature in Triton Inference Server.

We value your input! If you're interested in seeing semantic caching as a
supported feature in future releases, we encourage you to [FILL IN]
Copy link
Collaborator

@rmccorm4 rmccorm4 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FILL IN

reminder for self

Comment on lines +313 to +315
Clearly, the latter 2 requests are semantically similar to the first one, which
resulted in a cache hit scenario, which reduced the latency of our model from
approx 1.1s to the average of 0.048s per request.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have any rough ideas of cache miss cost? ex: 1 request without semantic caching vs 1 request with semantic caching? Just curious on the rough magnitude of the overhead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably do some estimations for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving this unresolved to remind myself to do this study as a follow-up

Copy link
Collaborator

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! It was very enjoyable to read 🤓

@krishung5
Copy link
Contributor

Nice tutorial! Really fun to try it out 🚀

GuanLuo
GuanLuo previously approved these changes Oct 23, 2024
krishung5
krishung5 previously approved these changes Oct 23, 2024
@oandreeva-nv oandreeva-nv dismissed stale reviews from krishung5 and GuanLuo via dd4de13 October 24, 2024 19:11
rmccorm4
rmccorm4 previously approved these changes Oct 25, 2024
Copy link
Collaborator

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Left some nits and suggested a PR title change - feel free to change

@rmccorm4 rmccorm4 changed the title Semantic Caching tutorial docs: Add Semantic Caching Tutorial Oct 25, 2024
@oandreeva-nv oandreeva-nv merged commit c52dcc1 into main Oct 26, 2024
3 checks passed
@oandreeva-nv oandreeva-nv deleted the oandreeva_semantic_caching_poc branch October 26, 2024 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants