Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added documentation of using warmups to initialize lora weights #515

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

TheCodeWrangler
Copy link

This PR provides documentation for converting lora adapters from a hugging face checkpoint into a warmup that can be used in the triton-inference-server TensorRT-LLM backend.

This approach allows for the LoRa weights to never be required for the client of the triton-inference-server backend and does not require loading or passing these weights from any of the python backend models (preprocessing) to avoid the numpy datatype conversion (which does not support bfloat16)

@smehta2000
Copy link

Tagging @kaiyux @byshiue to help triage and/or add to review board, thanks!

@TheCodeWrangler
Copy link
Author

Curious to get any feedback here

This update is also related to a performance issue I am seeing.
NVIDIA/TensorRT-LLM#1957

This PR gets results much closer to the expected outputs but not fully in line with huggingface/ pre-compiled results. Would love to have some feedback on the process for preparation of the adapter weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants