Key Changes

TensorRT-LLM 0.7.1 Upgrade, including support for Mixtral 8x7B MOE model
Optimum Neuron Support
Transformers-NeuronX 2.16 Upgrade, including support for continuous batching
LlamaCPP support
Many Documentation updates with updated model deployment configurations
Refactor of configuration management across different backends
CUDA 12.1 support for DeepSpeed and TensorRT-LLM containers

Enhancements

[UX][RollingBatch] add details function to the rolling batch by @lanking520 in #1353
[TRTLLM][UX] add trtllm changes to support stop reason and also log prob by @lanking520 in #1355
[Docker] upgrade cuda 12.1 support for DJLServing by @lanking520 in #1370
[feat] optimum handler creation by @tosterberg in #1362
[python] Update lmi_dist warmup logic by @xyang16 in #1367
[RollingBatch] optimize rolling batch result by @lanking520 in #1372
[python] Sets mpi_model property for python to consume by @frankfliu in #1360
[vLLM] add load_format to support for mixtral model by @lanking520 in #1391
[python] Sets rolling batch threads as daemon thread by @frankfliu in #1371
[awscurl] Adds awscurl to repo by @frankfliu in #1408
Add config passing in lmi-dist by @xyang16 in #1382
Upgrade flash attention to 2.3.0 by @xyang16 in #1402
[TRTLLM] Bump up trtllm to version 0.7.1 by @ydm-amazon in #1452
[tnx] add gqa to properties by @siddvenk in #1478
[TRTLLM] add enable kv cache reuse by @lanking520 in #1460
[serving] Adds llama.cpp support by @frankfliu in #1464
[serving] Allows plugin to override default HTTP handler by @frankfliu in #1424
[wlm] enable max workers env var for MPI mode by @frankfliu in #1438
Support AWQ quantization in LMI Dist by @xyang16 in #1435
[python] Excludes test code from jar by @frankfliu in #1449
[Refactor][UX] Refactoring vllm rolling batch properties by @sindhuvahinis in #1369
[DLC][TNX] inf2 stable diffusion handler refactor by @tosterberg in #1393
[Refactor] lmi dist rolling batch properties by @sindhuvahinis in #1409
[Refactor] scheduler rolling batch refactor by @sindhuvahinis in #1411
[awscurl] Allows search nested json key by @frankfliu in #1453
make jsonline outputs generated tokens by @lanking520 in #1454
[serving] Loads model zoo and engine from deps folder on startup by @frankfliu in #1457
[RollingBatch] add customized rollingbatch by @lanking520 in #1468

Bug Fixes

Fix rolling batch properties by @xyang16 in #1326
[fix] tnx quantization and docs by @tosterberg in #1332
[fix] Context length estimate datatype by @sindhuvahinis in #1350
[UX][CI] fix a few bugs by @lanking520 in #1357
[fix] inf2 container freeze compiler versions by @tosterberg in #1389
[Fix] fix the lmi dist device by @sindhuvahinis in #1387
[vllm] pass hf revision to vllm engine, pin phi2 model revision for test by @siddvenk in #1485
[python] Fixes mpi_mode properties by @frankfliu in #1368
[python] Fixes mpi_mode issues by @frankfliu in #1373
[wlm] Fixes get maxWorkers bug for python engine by @frankfliu in #1375
[RollingBatch] fix request id in rolling batch by @lanking520 in #1481
[TRTLLM] Fix bug in handler by @ydm-amazon in #1459
Works with manual initialization by @zachgk in #1473
[TNX] version update to 2.16.0 sdk and continuous batching by @tosterberg in #1437

Documentation Updates

[doc] Update current properties for TNX handler by @tosterberg in #1322
[doc] lmi configurations readme by @sindhuvahinis in #1323
[doc] Placeholder for TrtLLM tutorial and tuning guide by @sindhuvahinis in #1333
[doc] LMI environment variable instruction by @sindhuvahinis in #1334
[doc] TransformerNeuronX tuning guide by @sindhuvahinis in #1335
[doc] TensorRt-Llm tuning guide by @sindhuvahinis in #1339
[doc] Updating new TensorRT-LLM configurations by @sindhuvahinis in #1340
[doc] DeepSpeed tuning guide by @sindhuvahinis in #1342
[doc] LMI dist tuning guide by @sindhuvahinis in #1341
[doc] seq_scheduler_document by @KexinFeng in #1336
[doc] large model inference document by @sindhuvahinis in #1343
[doc] fix docker image uri for trtllm tutorial by @sindhuvahinis in #1348
[docs] Adds option.max_output_size document by @frankfliu in #1354
[docs] fix tnx n_positions description by @tosterberg in #1401
[doc] instruction on adding new properties to default handlers by @sindhuvahinis in #1419
Add AOT Tutorial by @ydm-amazon in #1338
[docker] Avoid JVM consume GPU memory by @frankfliu in #1365
[LMI] DJLServing side placeholder by @lanking520 in #1330
[Tutorials] add tensorrt llm manual by @lanking520 in #1412
[TRTLLM] add line in docs for chatglm by @ydm-amazon in #1425
Update LMI dist tuning guide by @xyang16 in #1428
[TRTLLM] update the docs and more model support by @lanking520 in #1415
[TRT-LLM] Update docs for newly added TRT-LLM build args in 0.7.1 by @rohithkrn in #1461
[TNX][config] update rolling batch batch size behavior and docs by @tosterberg in #1404
[TRTLLM] Update the docs - add mixtral by @ydm-amazon in #1434
[TRTLLM] Add gpt model to docs and ci by @ydm-amazon in #1475

CI/CD Updates

[tnx] version bump to 2.15.2 by @tosterberg in #1363
[CI][IB] Support variables by @zachgk in #1356
Bump up DJL version to 0.26.0 by @xyang16 in #1364
[ci] Fixes nightly gpu integration test by @frankfliu in #1378
[CI] update the model to fp16 by @lanking520 in #1390
update models for TRT-LLM 0.6.1 by @rohithkrn in #1392
[CI][fix] Sagemaker integration test cloudwatch metrics fix by @sindhuvahinis in #1385
[CI][fix] Inf2 AOT integration test fix by @tosterberg in #1395
[ci] Fixes flaky async token test by @frankfliu in #1429
[ci] Fixes merge conflict issue by @frankfliu in #1431
[ci] Upgrades CI to use JDK 17 by @frankfliu in #1413
[CI][fix] remove g5xl and introduce rolling batch in lmic by @sindhuvahinis in #1396
[test] Increases curl timeout for test client by @frankfliu in #1379
remove invalid test by @lanking520 in #1376
allow git permission to aws assumed role by @lanking520 in #1377
[CI] use awscurl for rollingbatch tests by @lanking520 in #1448
[CI] add trtllm P4D tests by @lanking520 in #1455
[ci] add information to memory assert in lmi integ tests by @siddvenk in #1487
[CI] update the memory range by @lanking520 in #1477
adding build script for lmi-dist dependencies by @lanking520 in #1374
[LMI-Deps] fix build source by @lanking520 in #1380
[docker] Updates version to 0.26.0 by @frankfliu in #1384
[TRTLLM] install ammo dependencies for quantization by @lanking520 in #1394
Revert "[TRTLLM] install ammo dependencies for quantization" by @lanking520 in #1400
[DLC][TNX] Freeze all aws-neuronx deps by @tosterberg in #1397
[TRTLLM][DLC] install ammo with TRTLLM for runtime compilation by @lanking520 in #1403
[awscurl] Removes commons-io and commons-codec dependency by @frankfliu in #1410
[TRTLLM][CI] add baichuan and internlm model to the ci for trtllm by @lanking520 in #1416
[CI] add mistral model tests by @lanking520 in #1417
[ci] Fixes build issues on aarch64 machine by @frankfliu in #1421
[ci] Suppress javadoc warnings. by @frankfliu in #1422
[TRTLLM] [CI] Add ChatGLM to trtllm ci by @ydm-amazon in #1430
[TRTLLM] add a few more models and tests by @lanking520 in #1432
[TRTLLM] add qwen model to CI by @lanking520 in #1440
Remove extra curly brace by @ydm-amazon in #1444
[TRTLLM] trust remote code for qwen by @lanking520 in #1446
Install flash attn v1 and v2 in order by @xyang16 in #1451
[CI] add ci for mixtral and phi2 model by @lanking520 in #1462
[CI] fix name builder by @lanking520 in #1463
[CI] add trust remote code to phi by @lanking520 in #1465
[docker] Add lmi_vllm wheel and environment variables by @xyang16 in #1482
[docker] Install lmi_vllm wheel by @xyang16 in #1483
add jinja2 for chat completion use case by @lanking520 in #1484
[serving] Updates dependencies to latest version by @frankfliu in #1466
[serving] Adds logs to ModelServerTest. by @frankfliu in #1480
also build megablocks by default by @lanking520 in #1383
[TRTLLM] bump up trtllm to 0.7.0 by @lanking520 in #1427
[CI] add 2 inf2 rolling batch tests by @lanking520 in #1472
[TNX][CI] setup java for tnx models by @lanking520 in #1476
[DLC][DeepSpeed] upgrade DeepSpeed container to 12.1 by @lanking520 in #1381
[docker] Upgrades JDK 17 for docker images by @frankfliu in #1418
[TRTLLM][CI] reduce memory for GPTJ by @lanking520 in #1439
Record benchmarks for manual by @zachgk in #1405
[DLC][TNX] optimum version bump by @tosterberg in #1414
[CI][IB] Parses template from S3 and local dir by @zachgk in #1406
[LMIDist] Add test that calls LmiDistRollingBatch with python script by @KexinFeng in #1443
Remove old flash models in LMI Dist by @xyang16 in #1447
downgrade pynvml version to be compatible with SM Hosts by @rohithkrn in #1467
[TRTLLM] build with tp1 by @lanking520 in #1433
parameterize trtllm version by @rohithkrn in #1441
remove megablocks from the list by @lanking520 in #1456
[TNX] llama 70b special param in cc flags by @lanking520 in #1469
add pydantic validator accepted type by @ydm-amazon in #1479

Full Changelog: v0.25.0...v0.26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DJLServing v0.26.0 Release

Key Changes

Enhancements

Bug Fixes

Documentation Updates

CI/CD Updates

Contributors