Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constrained Decoding with Python backend and BLS #7778

Open
MatteoPagliani opened this issue Nov 8, 2024 · 3 comments
Open

Constrained Decoding with Python backend and BLS #7778

MatteoPagliani opened this issue Nov 8, 2024 · 3 comments
Assignees
Labels
module: backends Issues related to the backends

Comments

@MatteoPagliani
Copy link

MatteoPagliani commented Nov 8, 2024

Hi,

I'm trying to integrate function calling capabilities for TensorRT-LLM models inside Triton Inference Server. I'm following this tutorial. I am interested in implementing the approach relying on external libraries, not the one that uses prompt engineering. In particular, I am planning to use outlines and grammars to guide generation.

I followed all the steps detailed in the tutorial linked above. I made the recommended modifications to the model.py files and added logits_post_processor_name in tensorrt_llm/config.pbtxt.

However, the tutorial uses ensemble models rather than BLS, therefore it suggests updating ensemble/config.pbtxt (so that ensemble model properly passes additional input parameter to tensorrt-llm model) adding

+ {
+   name: "logits_post_processor_name"
+   data_type: TYPE_STRING
+   dims: [ -1 ]
+   optional: true
+ }

and

+     input_map {
+       key: "logits_post_processor_name"
+       value: "logits_post_processor_name"
+     }

In order to use BLS instead of ensemble I thought it was sufficient to add logits_post_processor_name to tensorrt_llm_bls/config.pbtxt as I did for tensorrt_llm/config.pbtxt. However this is not the case.

The problem I'm facing is that the request that is provided to tensorrt_llm/1/model.py does not contain the field logits_post_processor_name. I checked the payload of request inside the main loop for request in requests: in the execute function.

Should I change other snippets of Python code to enable use of constrained decoding in BLS? Could you please provide guidance on how the BLS approach should be handled with respect to the ensemble approach? Thanks in advance for any help!

I'm tagging @oandreeva-nv since she contributed to the tutorial about Constrained Decoding and she has already been really helpful in the past.

@rmccorm4
Copy link
Collaborator

Hi @MatteoPagliani, thanks for raising this issue! I think your observations are correct, and you may need a little extra logic to ensure that the new BLS inputs are properly propagated to the correct model, compared to the ensemble which automatically propagates the inputs.

In this case, I think it may require adding the new input to this name_map. If this doesn't work, you can probably trace through the code to examine the tensors being put into the request object at a few points in time, such as here or here. Otherwise, you can reach out in https://github.com/triton-inference-server/tensorrtllm_backend/issues for further help if it's not intuitive to add a new BLS input.

Hope this helps!

@rmccorm4 rmccorm4 added the module: backends Issues related to the backends label Nov 14, 2024
@rmccorm4 rmccorm4 self-assigned this Nov 14, 2024
@MatteoPagliani
Copy link
Author

MatteoPagliani commented Nov 14, 2024

Hi @rmccorm4 and thanks a lot for your reply!

I managed to make it work by adding some snippets of code, including doing the change to the name_map that you mentioned.

However, this is not ideal since the Triton Model Repository in which we need to change the code is usually "generated" dinamically using the fill_template.py and this would require us to do the modifications each time we "generate" a new Triton Model Repository.

Do you have any plans to integrate constrained decoding / function calling capabilities natively in Triton Inference Server / TensorRT-LLM backend?

I can also see that the new OpenAI-API-compatible frontend is out but tools and tool_choice parameters in the openai_schema are not connected to any logic right now. Are you going to work on this?

Thanks!

@rmccorm4
Copy link
Collaborator

Hi @MatteoPagliani,

Glad you got it to work!

However, this is not ideal since the Triton Model Repository in which we need to change the code is usually "generated" dinamically using the fill_template.py and this would require us to do the modifications each time we "generate" a new Triton Model Repository.

Do you have any plans to integrate constrained decoding / function calling capabilities natively in Triton Inference Server / TensorRT-LLM backend?

This would be a good feature request to raise in https://github.com/triton-inference-server/tensorrtllm_backend/issues for the best way to get constrained decoding out of the box with the TRT-LLM backend.

I can also see that the new OpenAI-API-compatible frontend is out but tools and tool_choice parameters in the openai_schema are not connected to any logic right now. Are you going to work on this?

You are correct that function/tool calling is not currently supported in that OpenAI-compatible frontend implementation. It is on our radar to eventually implement, but there's no specific timeline at the moment. We're open to contributions if you're interested in contributing to those features!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: backends Issues related to the backends
Development

No branches or pull requests

2 participants