v0.1.0b7

mfuntowicz released this 24 May 11:24

· 31 commits to main since this release

d19ce46

Highlights

Mixtral models are now supported (requires a multi-gpu setup)
Tensor Parallelism & Pipeline Parallelism are supported on from_pretrained and pipeline through the use of tp=<int>, pp=<int>
Models from transformers are now loaded in their respective checkpoint data type rather than float32 avoiding most of memory errors that were happening in 0.1.0b6
Intermediate TensorRT-LLM checkpoints and engines are now saved in two different folders (checkpoints/ and engines/) to avoid issues when building multiple checkpoints with the same config.json (TP / PP setup)

What's Changed

Fix checking output limits for #114 by @zaycev in #115
Test batched causallm inference by @fxmarty in #117
Remove claim of Turing support by @laikhtewari in #118
Mention important additional parameters for engine config in README by @zaycev in #113
Update to TensorRT-LLM v0.9.0 by @mfuntowicz in #124
Use a percentage based matching rather than exact token match for tests by @mfuntowicz in #125
Mixtral by @mfuntowicz in #131

New Contributors

@zaycev made their first contribution in #115

Full Changelog: v0.1.0b6...v0.1.0b7

Contributors

zaycev, mfuntowicz, and 2 other contributors

Assets 2