Constrained Decoding with Python backend and BLS #7778

MatteoPagliani · 2024-11-08T17:34:21Z

Hi,

I'm trying to integrate function calling capabilities for TensorRT-LLM models inside Triton Inference Server. I'm following this tutorial. I am interested in implementing the approach relying on external libraries, not the one that uses prompt engineering. In particular, I am planning to use outlines and grammars to guide generation.

I followed all the steps detailed in the tutorial linked above. I made the recommended modifications to the model.py files and added logits_post_processor_name in tensorrt_llm/config.pbtxt.

However, the tutorial uses ensemble models rather than BLS, therefore it suggests updating ensemble/config.pbtxt (so that ensemble model properly passes additional input parameter to tensorrt-llm model) adding

+ {
+   name: "logits_post_processor_name"
+   data_type: TYPE_STRING
+   dims: [ -1 ]
+   optional: true
+ }

and

+     input_map {
+       key: "logits_post_processor_name"
+       value: "logits_post_processor_name"
+     }

In order to use BLS instead of ensemble I thought it was sufficient to add logits_post_processor_name to tensorrt_llm_bls/config.pbtxt as I did for tensorrt_llm/config.pbtxt. However this is not the case.

The problem I'm facing is that the request that is provided to tensorrt_llm/1/model.py does not contain the field logits_post_processor_name. I checked the payload of request inside the main loop for request in requests: in the execute function.

Should I change other snippets of Python code to enable use of constrained decoding in BLS? Could you please provide guidance on how the BLS approach should be handled with respect to the ensemble approach? Thanks in advance for any help!

I'm tagging @oandreeva-nv since she contributed to the tutorial about Constrained Decoding and she has already been really helpful in the past.

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2024-11-14T01:15:50Z

Hi @MatteoPagliani, thanks for raising this issue! I think your observations are correct, and you may need a little extra logic to ensure that the new BLS inputs are properly propagated to the correct model, compared to the ensemble which automatically propagates the inputs.

In this case, I think it may require adding the new input to this name_map. If this doesn't work, you can probably trace through the code to examine the tensors being put into the request object at a few points in time, such as here or here. Otherwise, you can reach out in https://github.com/triton-inference-server/tensorrtllm_backend/issues for further help if it's not intuitive to add a new BLS input.

Hope this helps!

MatteoPagliani · 2024-11-14T13:39:18Z

Hi @rmccorm4 and thanks a lot for your reply!

I managed to make it work by adding some snippets of code, including doing the change to the name_map that you mentioned.

However, this is not ideal since the Triton Model Repository in which we need to change the code is usually "generated" dinamically using the fill_template.py and this would require us to do the modifications each time we "generate" a new Triton Model Repository.

Do you have any plans to integrate constrained decoding / function calling capabilities natively in Triton Inference Server / TensorRT-LLM backend?

I can also see that the new OpenAI-API-compatible frontend is out but tools and tool_choice parameters in the openai_schema are not connected to any logic right now. Are you going to work on this?

Thanks!

rmccorm4 · 2024-11-15T07:55:18Z

Hi @MatteoPagliani,

Glad you got it to work!

However, this is not ideal since the Triton Model Repository in which we need to change the code is usually "generated" dinamically using the fill_template.py and this would require us to do the modifications each time we "generate" a new Triton Model Repository.

Do you have any plans to integrate constrained decoding / function calling capabilities natively in Triton Inference Server / TensorRT-LLM backend?

This would be a good feature request to raise in https://github.com/triton-inference-server/tensorrtllm_backend/issues for the best way to get constrained decoding out of the box with the TRT-LLM backend.

I can also see that the new OpenAI-API-compatible frontend is out but tools and tool_choice parameters in the openai_schema are not connected to any logic right now. Are you going to work on this?

You are correct that function/tool calling is not currently supported in that OpenAI-compatible frontend implementation. It is on our radar to eventually implement, but there's no specific timeline at the moment. We're open to contributions if you're interested in contributing to those features!

rmccorm4 added the module: backends Issues related to the backends label Nov 14, 2024

rmccorm4 self-assigned this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constrained Decoding with Python backend and BLS #7778

Constrained Decoding with Python backend and BLS #7778

MatteoPagliani commented Nov 8, 2024 •

edited

Loading

rmccorm4 commented Nov 14, 2024

MatteoPagliani commented Nov 14, 2024 •

edited

Loading

rmccorm4 commented Nov 15, 2024

Constrained Decoding with Python backend and BLS #7778

Constrained Decoding with Python backend and BLS #7778

Comments

MatteoPagliani commented Nov 8, 2024 • edited Loading

rmccorm4 commented Nov 14, 2024

MatteoPagliani commented Nov 14, 2024 • edited Loading

rmccorm4 commented Nov 15, 2024

MatteoPagliani commented Nov 8, 2024 •

edited

Loading

MatteoPagliani commented Nov 14, 2024 •

edited

Loading