v0.1.0b7
Highlights
- Mixtral models are now supported (requires a multi-gpu setup)
- Tensor Parallelism & Pipeline Parallelism are supported on
from_pretrained
andpipeline
through the use oftp=<int>
,pp=<int>
- Models from
transformers
are now loaded in their respective checkpoint data type rather thanfloat32
avoiding most of memory errors that were happening in 0.1.0b6 - Intermediate TensorRT-LLM checkpoints and engines are now saved in two different folders (
checkpoints/
andengines/
) to avoid issues when building multiple checkpoints with the sameconfig.json
(TP / PP setup)
What's Changed
- Fix checking output limits for #114 by @zaycev in #115
- Test batched causallm inference by @fxmarty in #117
- Remove claim of Turing support by @laikhtewari in #118
- Mention important additional parameters for engine config in README by @zaycev in #113
- Update to TensorRT-LLM v0.9.0 by @mfuntowicz in #124
- Use a percentage based matching rather than exact token match for tests by @mfuntowicz in #125
- Mixtral by @mfuntowicz in #131
New Contributors
Full Changelog: v0.1.0b6...v0.1.0b7