draft: Added gRPC timer for graceful shutdown of inflight requests #7835

mattwittwer · 2024-11-25T18:50:56Z

What does the PR do?

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Added support for OTel context propagation --------- Co-authored-by: Markus Hennerbichler <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>

This validates the change made to ../core wrt how model configuration mtime is handled.

* Run all cases wihh shm probe * Warmup test and then run multiple iterations * Log free shared memory on enter/exit of probe * Add shm probe to all tests * Add debug_str to shm_util * Refactor ensemble_io test, modify probe to check for growth rather than inequality * Improve stability of bls_tensor_lifecycle gpu memory tests * Add more visibility into failing model/case in python_unittest helper * [FIXME] Skip probe on certain subtests for now * [FIXME] Remove shm probe from test_restart on unhealthy stub * Start clean server run for each bls test case * Don't exit early on failure so logs can be properly collected * Restore bls test logic * Fix shm size compare * Print region name that leaked * Remove special handling on unittest * Remove debug str * Add enter and exit delay to shm leak probe --------- Co-authored-by: Ryan McCormick <[email protected]>

…ackend_python (#6823)

* Update trace_summery script * Remove GRPC_WAITREAD and Overhead

* Add gsutil cp retry helper function * Add max retry to GCS upload * Use simple sequential upload

* Handle empty output * Add test case for 0 dimension output * Fix up number of tests

* tensorrt-llm benchmarking test

* Update README and versions for 2.42.0 / 24.01 (#6789) * Update versions * Update README and versions for 2.42.0 / 24.01 * Fix documentaation genarion (#6801) * Ser version of sphix to 5.0 * Set verions 5.0.0 * Update README.md and versions post 24.01

…und (#6834) * Update miniconda version * Install pytest for different py version * Install pytest

* Add test for shutdown while loading * Fix intermittent failure on test_model_config_overwrite

Adding OpenTelemetry Batch Span Processor --------- Co-authored-by: Theo Clark <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>

* Support Double-Type Infer/Response Parameters

* Base Python Backend Support for Windows

* Add unit test reports to L0_dlpack_multi_gpu * Add unit test reports to L0_warmup

* Add response statistics * Add L0_response_statistics * Enable http vs grpc statistics comparison * Add docs for response statistics protocol * Add more comments for response statistics test * Remove model name from config * Improve docs wordings * [Continue] Improve docs wordings * [Continue] Add more comments for response statistics test * [Continue 2] Improve docs wordings * Fix typo * Remove mentioning decoupled from docs * [Continue 3] Improve docs wordings * [Continue 4] Improve docs wordings Co-authored-by: Ryan McCormick <[email protected]> --------- Co-authored-by: Ryan McCormick <[email protected]>

* Switch to Python model for busyop test * Clean up * Address comment * Remove unused import

…ublished containers (#7759)

…am metric buckets (#7752)

…7770)

Co-authored-by: Kyle McGill <[email protected]>

…t version map to dictionary (#7500) Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Kyle McGill <[email protected]>

Co-authored-by: GuanLuo <[email protected]>

…ference (#7743)

…riton-inference-server/server into mwittwer/gRPC-endpoint-timeout

mc-nv and others added 30 commits January 12, 2024 14:33

Update 'main' to track development of 2.42.0 / 24.02 (#6786)

c205451

Support for Context Propagation for OTel trace mode (#6785)

4a719e4

Added support for OTel context propagation --------- Co-authored-by: Markus Hennerbichler <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>

Use current time when overwriting model configuration. (#6727)

87165b2

This validates the change made to ../core wrt how model configuration mtime is handled.

Added docs for otel context propagation (#6804)

7b06a37

Fix typos in trace.md (#6808)

b6e017e

Fix test_model_config_overwite in L0_lifecycle (#6818)

3e79b2a

Remove boost::filesystem (#6810)

3bff367

Generate unittest xml reports from L0_python_api (#6822)

bc71da0

Add unit test reports to L0_json, L0_metrics, L0_response_cache, L0_b…

6192c6e

…ackend_python (#6823)

Update trace summary script (#6758)

a514a05

* Update trace_summery script * Remove GRPC_WAITREAD and Overhead

Add gsutil upload retry helper function (#6817)

28f497c

* Add gsutil cp retry helper function * Add max retry to GCS upload * Use simple sequential upload

Add test for shutdown while unloading in background (#6835)

ddfdb2a

Handle 0 dimension output for generate endpoint (#6833)

56e4232

* Handle empty output * Add test case for 0 dimension output * Fix up number of tests

tensorrt-llm benchmarking test (#6771)

d98a59c

* tensorrt-llm benchmarking test

Use libmamba solver for L0_backend_python env test. Fix pytest not fo…

d0e2653

…und (#6834) * Update miniconda version * Install pytest for different py version * Install pytest

Add test for shutdown while loading model (#6837)

f92732d

* Add test for shutdown while loading * Fix intermittent failure on test_model_config_overwrite

Adding OpenTelemetry Batch Span Processor (#6842)

776e641

Adding OpenTelemetry Batch Span Processor --------- Co-authored-by: Theo Clark <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>

Support Double-Type Inference Request/Response Parameters (#6755)

b0a495a

* Support Double-Type Infer/Response Parameters

Updating vllm version to 0.3.0 (#6858)

508929a

Python Backend Windows Support (#6830)

738c98f

* Base Python Backend Support for Windows

Add support for Oracle Cloud in deploy (#6850)

3d79568

Add link to TRTLLM metrics docs (#6874)

1df73dc

Add unit test reports to L0_dlpack_multi_gpu and L0_warmup (#6873)

4294cc6

* Add unit test reports to L0_dlpack_multi_gpu * Add unit test reports to L0_warmup

Set OV version to 2023.3.0 (#6880)

f078bfb

Fixing StringTo uint32_t used only by tracing (#6883)

80fc56c

Update 'main' to track development of 2.44.0 / 24.03 (#6892)

8a2a229

Fix busyop test for L0_memory_growth (#6900)

21a7fc5

* Switch to Python model for busyop test * Clean up * Address comment * Remove unused import

KrishnanPrash and others added 18 commits November 4, 2024 14:15

build: Install tritonfrontend and tritonserver wheels by default in p…

8941e15

…ublished containers (#7759)

Fix model generation (#7764)

1f7a516

test: Test per-model metric customization and document custom histogr…

6191c67

…am metric buckets (#7752)

fix: Fixing pip installation as a system package (#7768)

4725600

fix: Adding copyright support for .pyi files (#7769)

5f8f07b

fix: Skip copyrights check for "expected" files in L0_model_config (#…

0269a3c

…7770)

Update 'main' to track development of 2.53.0 / 24.12 (#7771)

51b304f

test: OpenAI frontend invalid chat tokenizer network issue WAR (#7779)

d2ecac1

Update ONNX version for generated models (#7785)

60f22e4

test: RHEL Filesystem Tests (#7788)

3c7a263

Update model generation scenario (#7793) (#7797)

66026e5

fix: Fix L0_input_validation (#7800)

d4d9ebc

build: Support RHEL ORT TensorRT Execution Provider (#7812)

3815390

Co-authored-by: Kyle McGill <[email protected]>

ci: modifying stat count for L0_server_status (#7820)

2eb481d

build: update build.py to pass versions as input parameter and conver…

fb89be7

…t version map to dictionary (#7500) Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Kyle McGill <[email protected]>

fix: Resolve integer overflow in Load API file decoding (#7787)

16154f2

Co-authored-by: GuanLuo <[email protected]>

Added gRPC shutdown timer for inflight reqests

2cf2851

feat: Enable deferred unregistering of shared memory regions after in…

eb1d290

…ference (#7743)

mattwittwer self-assigned this Nov 25, 2024

Merge branch 'main' into mwittwer/gRPC-endpoint-timeout

71b94dc

statiraju requested review from indrajit96, GuanLuo and tanmayv25 November 25, 2024 19:18

Matt Wittwer added 4 commits November 25, 2024 19:21

cleanup formatting

4dfb96b

Merge branch 'mwittwer/gRPC-endpoint-timeout' of https://github.com/t…

deedd15

…riton-inference-server/server into mwittwer/gRPC-endpoint-timeout

fix formatting

60000e9

fix up whitespace

32601bf

statiraju requested a review from krishung5 January 6, 2025 19:17

pvijayakrish force-pushed the mwittwer/gRPC-endpoint-timeout branch from a238be0 to 32601bf Compare January 15, 2025 17:13

mattwittwer closed this Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft: Added gRPC timer for graceful shutdown of inflight requests #7835

draft: Added gRPC timer for graceful shutdown of inflight requests #7835

mattwittwer commented Nov 25, 2024

draft: Added gRPC timer for graceful shutdown of inflight requests #7835

draft: Added gRPC timer for graceful shutdown of inflight requests #7835

Conversation

mattwittwer commented Nov 25, 2024

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)