DJLServing v0.26.0 Release
Key Changes
- TensorRT-LLM 0.7.1 Upgrade, including support for Mixtral 8x7B MOE model
- Optimum Neuron Support
- Transformers-NeuronX 2.16 Upgrade, including support for continuous batching
- LlamaCPP support
- Many Documentation updates with updated model deployment configurations
- Refactor of configuration management across different backends
- CUDA 12.1 support for DeepSpeed and TensorRT-LLM containers
Enhancements
- [UX][RollingBatch] add details function to the rolling batch by @lanking520 in #1353
- [TRTLLM][UX] add trtllm changes to support stop reason and also log prob by @lanking520 in #1355
- [Docker] upgrade cuda 12.1 support for DJLServing by @lanking520 in #1370
- [feat] optimum handler creation by @tosterberg in #1362
- [python] Update lmi_dist warmup logic by @xyang16 in #1367
- [RollingBatch] optimize rolling batch result by @lanking520 in #1372
- [python] Sets mpi_model property for python to consume by @frankfliu in #1360
- [vLLM] add load_format to support for mixtral model by @lanking520 in #1391
- [python] Sets rolling batch threads as daemon thread by @frankfliu in #1371
- [awscurl] Adds awscurl to repo by @frankfliu in #1408
- Add config passing in lmi-dist by @xyang16 in #1382
- Upgrade flash attention to 2.3.0 by @xyang16 in #1402
- [TRTLLM] Bump up trtllm to version 0.7.1 by @ydm-amazon in #1452
- [tnx] add gqa to properties by @siddvenk in #1478
- [TRTLLM] add enable kv cache reuse by @lanking520 in #1460
- [serving] Adds llama.cpp support by @frankfliu in #1464
- [serving] Allows plugin to override default HTTP handler by @frankfliu in #1424
- [wlm] enable max workers env var for MPI mode by @frankfliu in #1438
- Support AWQ quantization in LMI Dist by @xyang16 in #1435
- [python] Excludes test code from jar by @frankfliu in #1449
- [Refactor][UX] Refactoring vllm rolling batch properties by @sindhuvahinis in #1369
- [DLC][TNX] inf2 stable diffusion handler refactor by @tosterberg in #1393
- [Refactor] lmi dist rolling batch properties by @sindhuvahinis in #1409
- [Refactor] scheduler rolling batch refactor by @sindhuvahinis in #1411
- [awscurl] Allows search nested json key by @frankfliu in #1453
- make jsonline outputs generated tokens by @lanking520 in #1454
- [serving] Loads model zoo and engine from deps folder on startup by @frankfliu in #1457
- [RollingBatch] add customized rollingbatch by @lanking520 in #1468
Bug Fixes
- Fix rolling batch properties by @xyang16 in #1326
- [fix] tnx quantization and docs by @tosterberg in #1332
- [fix] Context length estimate datatype by @sindhuvahinis in #1350
- [UX][CI] fix a few bugs by @lanking520 in #1357
- [fix] inf2 container freeze compiler versions by @tosterberg in #1389
- [Fix] fix the lmi dist device by @sindhuvahinis in #1387
- [vllm] pass hf revision to vllm engine, pin phi2 model revision for test by @siddvenk in #1485
- [python] Fixes mpi_mode properties by @frankfliu in #1368
- [python] Fixes mpi_mode issues by @frankfliu in #1373
- [wlm] Fixes get maxWorkers bug for python engine by @frankfliu in #1375
- [RollingBatch] fix request id in rolling batch by @lanking520 in #1481
- [TRTLLM] Fix bug in handler by @ydm-amazon in #1459
- Works with manual initialization by @zachgk in #1473
- [TNX] version update to 2.16.0 sdk and continuous batching by @tosterberg in #1437
Documentation Updates
- [doc] Update current properties for TNX handler by @tosterberg in #1322
- [doc] lmi configurations readme by @sindhuvahinis in #1323
- [doc] Placeholder for TrtLLM tutorial and tuning guide by @sindhuvahinis in #1333
- [doc] LMI environment variable instruction by @sindhuvahinis in #1334
- [doc] TransformerNeuronX tuning guide by @sindhuvahinis in #1335
- [doc] TensorRt-Llm tuning guide by @sindhuvahinis in #1339
- [doc] Updating new TensorRT-LLM configurations by @sindhuvahinis in #1340
- [doc] DeepSpeed tuning guide by @sindhuvahinis in #1342
- [doc] LMI dist tuning guide by @sindhuvahinis in #1341
- [doc] seq_scheduler_document by @KexinFeng in #1336
- [doc] large model inference document by @sindhuvahinis in #1343
- [doc] fix docker image uri for trtllm tutorial by @sindhuvahinis in #1348
- [docs] Adds option.max_output_size document by @frankfliu in #1354
- [docs] fix tnx n_positions description by @tosterberg in #1401
- [doc] instruction on adding new properties to default handlers by @sindhuvahinis in #1419
- Add AOT Tutorial by @ydm-amazon in #1338
- [docker] Avoid JVM consume GPU memory by @frankfliu in #1365
- [LMI] DJLServing side placeholder by @lanking520 in #1330
- [Tutorials] add tensorrt llm manual by @lanking520 in #1412
- [TRTLLM] add line in docs for chatglm by @ydm-amazon in #1425
- Update LMI dist tuning guide by @xyang16 in #1428
- [TRTLLM] update the docs and more model support by @lanking520 in #1415
- [TRT-LLM] Update docs for newly added TRT-LLM build args in 0.7.1 by @rohithkrn in #1461
- [TNX][config] update rolling batch batch size behavior and docs by @tosterberg in #1404
- [TRTLLM] Update the docs - add mixtral by @ydm-amazon in #1434
- [TRTLLM] Add gpt model to docs and ci by @ydm-amazon in #1475
CI/CD Updates
- [tnx] version bump to 2.15.2 by @tosterberg in #1363
- [CI][IB] Support variables by @zachgk in #1356
- Bump up DJL version to 0.26.0 by @xyang16 in #1364
- [ci] Fixes nightly gpu integration test by @frankfliu in #1378
- [CI] update the model to fp16 by @lanking520 in #1390
- update models for TRT-LLM 0.6.1 by @rohithkrn in #1392
- [CI][fix] Sagemaker integration test cloudwatch metrics fix by @sindhuvahinis in #1385
- [CI][fix] Inf2 AOT integration test fix by @tosterberg in #1395
- [ci] Fixes flaky async token test by @frankfliu in #1429
- [ci] Fixes merge conflict issue by @frankfliu in #1431
- [ci] Upgrades CI to use JDK 17 by @frankfliu in #1413
- [CI][fix] remove g5xl and introduce rolling batch in lmic by @sindhuvahinis in #1396
- [test] Increases curl timeout for test client by @frankfliu in #1379
- remove invalid test by @lanking520 in #1376
- allow git permission to aws assumed role by @lanking520 in #1377
- [CI] use awscurl for rollingbatch tests by @lanking520 in #1448
- [CI] add trtllm P4D tests by @lanking520 in #1455
- [ci] add information to memory assert in lmi integ tests by @siddvenk in #1487
- [CI] update the memory range by @lanking520 in #1477
- adding build script for lmi-dist dependencies by @lanking520 in #1374
- [LMI-Deps] fix build source by @lanking520 in #1380
- [docker] Updates version to 0.26.0 by @frankfliu in #1384
- [TRTLLM] install ammo dependencies for quantization by @lanking520 in #1394
- Revert "[TRTLLM] install ammo dependencies for quantization" by @lanking520 in #1400
- [DLC][TNX] Freeze all aws-neuronx deps by @tosterberg in #1397
- [TRTLLM][DLC] install ammo with TRTLLM for runtime compilation by @lanking520 in #1403
- [awscurl] Removes commons-io and commons-codec dependency by @frankfliu in #1410
- [TRTLLM][CI] add baichuan and internlm model to the ci for trtllm by @lanking520 in #1416
- [CI] add mistral model tests by @lanking520 in #1417
- [ci] Fixes build issues on aarch64 machine by @frankfliu in #1421
- [ci] Suppress javadoc warnings. by @frankfliu in #1422
- [TRTLLM] [CI] Add ChatGLM to trtllm ci by @ydm-amazon in #1430
- [TRTLLM] add a few more models and tests by @lanking520 in #1432
- [TRTLLM] add qwen model to CI by @lanking520 in #1440
- Remove extra curly brace by @ydm-amazon in #1444
- [TRTLLM] trust remote code for qwen by @lanking520 in #1446
- Install flash attn v1 and v2 in order by @xyang16 in #1451
- [CI] add ci for mixtral and phi2 model by @lanking520 in #1462
- [CI] fix name builder by @lanking520 in #1463
- [CI] add trust remote code to phi by @lanking520 in #1465
- [docker] Add lmi_vllm wheel and environment variables by @xyang16 in #1482
- [docker] Install lmi_vllm wheel by @xyang16 in #1483
- add jinja2 for chat completion use case by @lanking520 in #1484
- [serving] Updates dependencies to latest version by @frankfliu in #1466
- [serving] Adds logs to ModelServerTest. by @frankfliu in #1480
- also build megablocks by default by @lanking520 in #1383
- [TRTLLM] bump up trtllm to 0.7.0 by @lanking520 in #1427
- [CI] add 2 inf2 rolling batch tests by @lanking520 in #1472
- [TNX][CI] setup java for tnx models by @lanking520 in #1476
- [DLC][DeepSpeed] upgrade DeepSpeed container to 12.1 by @lanking520 in #1381
- [docker] Upgrades JDK 17 for docker images by @frankfliu in #1418
- [TRTLLM][CI] reduce memory for GPTJ by @lanking520 in #1439
- Record benchmarks for manual by @zachgk in #1405
- [DLC][TNX] optimum version bump by @tosterberg in #1414
- [CI][IB] Parses template from S3 and local dir by @zachgk in #1406
- [LMIDist] Add test that calls
LmiDistRollingBatch
with python script by @KexinFeng in #1443 - Remove old flash models in LMI Dist by @xyang16 in #1447
- downgrade pynvml version to be compatible with SM Hosts by @rohithkrn in #1467
- [TRTLLM] build with tp1 by @lanking520 in #1433
- parameterize trtllm version by @rohithkrn in #1441
- remove megablocks from the list by @lanking520 in #1456
- [TNX] llama 70b special param in cc flags by @lanking520 in #1469
- add pydantic validator accepted type by @ydm-amazon in #1479
Full Changelog: v0.25.0...v0.26