-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constrained Decoding with Python backend and BLS #7778
Comments
Hi @MatteoPagliani, thanks for raising this issue! I think your observations are correct, and you may need a little extra logic to ensure that the new BLS inputs are properly propagated to the correct model, compared to the In this case, I think it may require adding the new input to this name_map. If this doesn't work, you can probably trace through the code to examine the tensors being put into the request object at a few points in time, such as here or here. Otherwise, you can reach out in https://github.com/triton-inference-server/tensorrtllm_backend/issues for further help if it's not intuitive to add a new BLS input. Hope this helps! |
Hi @rmccorm4 and thanks a lot for your reply! I managed to make it work by adding some snippets of code, including doing the change to the name_map that you mentioned. However, this is not ideal since the Triton Model Repository in which we need to change the code is usually "generated" dinamically using the fill_template.py and this would require us to do the modifications each time we "generate" a new Triton Model Repository. Do you have any plans to integrate constrained decoding / function calling capabilities natively in Triton Inference Server / TensorRT-LLM backend? I can also see that the new OpenAI-API-compatible frontend is out but Thanks! |
Hi @MatteoPagliani, Glad you got it to work!
This would be a good feature request to raise in https://github.com/triton-inference-server/tensorrtllm_backend/issues for the best way to get constrained decoding out of the box with the TRT-LLM backend.
You are correct that function/tool calling is not currently supported in that OpenAI-compatible frontend implementation. It is on our radar to eventually implement, but there's no specific timeline at the moment. We're open to contributions if you're interested in contributing to those features! |
Hi,
I'm trying to integrate function calling capabilities for TensorRT-LLM models inside Triton Inference Server. I'm following this tutorial. I am interested in implementing the approach relying on external libraries, not the one that uses prompt engineering. In particular, I am planning to use outlines and grammars to guide generation.
I followed all the steps detailed in the tutorial linked above. I made the recommended modifications to the
model.py
files and addedlogits_post_processor_name
intensorrt_llm/config.pbtxt
.However, the tutorial uses ensemble models rather than BLS, therefore it suggests updating
ensemble/config.pbtxt
(so that ensemble model properly passes additional input parameter to tensorrt-llm model) addingand
In order to use BLS instead of ensemble I thought it was sufficient to add
logits_post_processor_name
totensorrt_llm_bls/config.pbtxt
as I did fortensorrt_llm/config.pbtxt
. However this is not the case.The problem I'm facing is that the
request
that is provided totensorrt_llm/1/model.py
does not contain the fieldlogits_post_processor_name
. I checked the payload ofrequest
inside the main loopfor request in requests:
in theexecute
function.Should I change other snippets of Python code to enable use of constrained decoding in BLS? Could you please provide guidance on how the BLS approach should be handled with respect to the ensemble approach? Thanks in advance for any help!
I'm tagging @oandreeva-nv since she contributed to the tutorial about Constrained Decoding and she has already been really helpful in the past.
The text was updated successfully, but these errors were encountered: