Update on the development branch #2298

DanBlanaru · 2024-10-08T10:32:19Z

DanBlanaru
Oct 8, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we have pushed an update to the development branch (and the Triton backend) this Oct 08, 2024.

This #2297 includes:

Features
- ReDrafter beam search logic is updated to match Apple's ReDrafter v1.1.
- Draft-Target speculative decoding now can be done natively with just TensorRT-LLM. The driver code is located in examples/run.py and documentation is in examples/draft_target_model/README.md.
- NVIDIA Volta GPU support is deprecated and will be removed in a future release.
API
- Add logits processor support to the ModelRunnerCpp class.
- Added isParticipant method to the C++ Executor API to check if the current process is a participant in the executor instance.
- [BREAKING CHANGE] Remove builder_opt from build_config and trtllm-build command.
Bug fixes
- Explicitly specify strongly_typed=False to build the fp16 vision engine for the multimodal example. TensorRT 10 made the default strongly_typed=True so fp32 vision engines are built, even if input ONNX files are fp16. This issue is now fixed.
- Fixed an issue with SmoothQuant calibration with custom datasets. Many thanks to the contribution by @Bhuvanesh09 in fix: add support for passing calib sequence length, and num samples + fixing use of custom calibration dataset for smoothquant in llama #2243.
- Fixed an issue surrounding trtllm-build --fast-build with fake or random weights. Thanks to @ZJLi2013 for flagging it in trtllm-build with --fast-build ignore transformer layers #2135.
- Fixed an issue concerning the accuracy of speculative decoding. Also changed internals concerning speculative decoding logits to be similar to HuggingFace's assistant_model.
Performance
- Improved customAllReduce performance by using Lamport-style AllReduce + Norm fusion.
- Set static input tensors once at the beginning instead of each iteration. (This should be especially noticeable for RNN based models because the RNN state pointers are currently separate for each layer.)
- Draft model now can trigger device memcpy over MPI to the target model's process in orchestrator mode. This reduces the latency between the end of the draft model generation and beginning of target inference.

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on the development branch #2298

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Update on the development branch #2298

DanBlanaru Oct 8, 2024 Maintainer

Replies: 0 comments

DanBlanaru
Oct 8, 2024
Maintainer