feat(data-plane): improve NCCL data plane configuration #69

ishandhanani · 2025-01-28T03:26:25Z

What does the PR do?

In order to make this example works in a K8s setting, we need the prefill node to be able to communicate to the decode node. However, the prefill worker cannot resolve another pod's hostname. In order to solve this, we expose a decode service and save it for the prefill worker to access via VLLM_DATA_PLANE_HOSTNAME.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

This was run using the current README examples for a single container and was successful. Relevant snippets are below

Prefill/decode worker output

INFO 01-28 18:57:16 llm_engine.py:497] Maximum concurrency for 1 sequences and 3500 tokens per request: 347.78x
INFO 01-28 18:57:16 llm_engine.py:493] Profiling with 347 sequences
INFO 01-28 18:57:16 llm_engine.py:497] Maximum concurrency for 347 sequences and 3500 tokens per request: 341.39x
INFO 01-28 18:57:16 llm_engine.py:493] Profiling with 341 sequences
INFO 01-28 18:57:16 llm_engine.py:497] Maximum concurrency for 341 sequences and 3500 tokens per request: 341.49x
INFO 01-28 18:57:16 gpu_executor.py:122] # GPU blocks: 74701, # CPU blocks: 4096
INFO 01-28 18:57:16 gpu_executor.py:126] Maximum concurrency for 3500 tokens per request: 341.49x
INFO 01-28 18:57:19 data_plane.py:152] Rank 0 binding to brev-h100-9-gpu01:13337
INFO 01-28 18:57:19 data_plane.py:153] Advertising to brev-h100-9-gpu01:13337
INFO 01-28 18:57:19 data_plane.py:161] Rank 0 connected to the server
INFO 01-28 18:57:19 kv_cache.py:75] Store set up
INFO 01-28 18:57:19 kv_cache.py:97] KVCacheHandler initialized
INFO 01-28 18:57:21 model_runner.py:1406] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 18:57:21 model_runner.py:1410] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 18:57:39 model_runner.py:1534] Graph capturing finished in 19 secs.
INFO 01-28 18:57:39 data_plane.py:152] Rank 1 binding to brev-h100-9-gpu01:13338
INFO 01-28 18:57:39 data_plane.py:153] Advertising to brev-h100-9-gpu01:13338
INFO 01-28 18:57:39 data_plane.py:161] Rank 1 connected to the server
INFO 01-28 18:57:39 kv_cache.py:75] Store set up
INFO 01-28 18:57:39 kv_cache.py:97] KVCacheHandler initialized
INFO 01-28 18:57:40 llm_engine.py:520] Setting max_num_seqs to 341
INFO 01-28 18:57:40 llm_engine.py:520] Setting max_num_seqs to 341
18:57:41 worker.py:266[Triton Worker] INFO: Worker started...
18:57:41 worker.py:241[Triton Worker] INFO: Starting generate handler...
18:57:41 worker.py:266[Triton Worker] INFO: Worker started...
18:57:41 worker.py:241[Triton Worker] INFO: Starting llama handler...

Inference result

INFO 01-28 19:05:04 async_llm_engine.py:207] Added request b76d58eb-77dd-443d-bbf8-da01326b051f.
INFO 01-28 19:05:04 async_llm_engine.py:175] Finished request b76d58eb-77dd-443d-bbf8-da01326b051f.
<SNIP>
INFO 01-28 19:05:04 async_llm_engine.py:207] Added request b76d58eb-77dd-443d-bbf8-da01326b051f___0.
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<290 bytes>', 'more_body': True}
data: {"id":"b76d58eb-77dd-443d-bbf8-da01326b051f","choices":[{"delta":{"content":"\n\n","role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1738091104,"model":"llama","system_fingerprint":"b76d58eb-77dd-443d-bbf8-da01326b051f","object":"chat.completion.chunk"}

INFO 01-28 19:05:04 async_llm_engine.py:175] Finished request b76d58eb-77dd-443d-bbf8-da01326b051f___0.
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<317 bytes>', 'more_body': True}
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<14 bytes>', 'more_body': True}
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<0 bytes>', 'more_body': False}
TRACE:    127.0.0.1:35186 - ASGI [4] Receive {'type': 'http.disconnect'}
data: {"id":"b76d58eb-77dd-443d-bbf8-da01326b051f","choices":[{"delta":{"content":"The capital of France is Paris.","role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1738091104,"model":"llama","system_fingerprint":"b76d58eb-77dd-443d-bbf8-da01326b051f","object":"chat.completion.chunk"}

CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

nnshah1 · 2025-01-28T19:55:28Z

container/deps/vllm/data_plane/data_plane.py

@@ -39,6 +39,7 @@
 from triton_distributed.icp.ucp_data_plane import DataPlaneError, UcpDataPlane

 logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)


should this be set via arg somewhere?

Please move this to deployment folder scripts if it is not already supported.

https://github.com/triton-inference-server/triton_distributed/blob/main/examples/llm/vllm/deploy/parser.py#L137

nnshah1 · 2025-01-28T20:01:24Z

container/deps/vllm/data_plane/data_plane.py

-        self._port = port
+        self._bind_hostname = bind_hostname or socket.gethostname()
+        self._advertise_hostname = advertise_hostname or self._bind_hostname
+        self._port = port or (13337 + torch.distributed.get_rank())


should we make 13337 a constant?

This will need environment variable. Pls consider VLLM_DATA_PLANE_MIN_PORT

The important scenario is with 4 workers running in single node and using 4 ports in single machine. Is it going to work with your change? I'm afraid that you need to define env variable with coma separate list of allowed ports to use and code should select the allowed port based on rank and this must be adjusted based on amount of workers running in the single machine in external scripts.

I'm testing config with two ports used at

INFO 01-28 13:00:17 data_plane.py:153] Rank 1 binding to node:13338 INFO 01-28 13:00:17 data_plane.py:161] Rank 1 connected to the server INFO 01-28 13:00:17 kv_cache.py:64] Store set up INFO 01-28 13:00:17 kv_cache.py:86] KVCacheHandler initialized INFO 01-28 13:00:21 llm_engine.py:497] Maximum concurrency for 5 sequences and 131072 tokens per request: 5.44x INFO 01-28 13:00:21 gpu_executor.py:122] # GPU blocks: 44577, # CPU blocks: 4096 INFO 01-28 13:00:21 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 5.44x INFO 01-28 13:00:22 data_plane.py:153] Rank 0 binding to node:13337 INFO 01-28 13:00:22 data_plane.py:161] Rank 0 connected to the server

nnshah1

LGTM - @piotrm-nvidia or @ptarasiewiczNV - can you verify?

piotrm-nvidia · 2025-01-28T20:34:02Z

container/deps/vllm/data_plane/kv_cache.py

+            if envs.VLLM_WORKER_ID >= envs.VLLM_CONTEXT_WORKERS:
+                # bind to actual host but advertise a service
+                self._data_plane = VllmNcclDataPlane(
+                    bind_hostname=socket.gethostname(),


It seems that actual bind host and host name send to other workers must be different for other hosts to successfully connect.

Can you draw some topology how it supposed to look like?

ptarasiewiczNV

lgtm, just need to address @piotrm-nvidia concerns.

ptarasiewiczNV · 2025-01-29T13:26:46Z

container/deps/vllm/data_plane/data_plane.py

@@ -39,6 +39,7 @@
 from triton_distributed.icp.ucp_data_plane import DataPlaneError, UcpDataPlane

 logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)


https://github.com/triton-inference-server/triton_distributed/blob/main/examples/llm/vllm/deploy/parser.py#L137

nvidia and others added 4 commits January 27, 2025 20:22

default debug logging

0a9c4e7

kv_cache: env var controls decode hostname

0eb3283

working e2e

2b7e137

typo fix

c2d3908

ishandhanani temporarily deployed to GITLAB January 28, 2025 03:26 — with GitHub Actions Inactive

ishandhanani temporarily deployed to GITLAB January 28, 2025 03:27 — with GitHub Actions Inactive

ishandhanani changed the title ~~[WIP] feat(llm): ensure disaggregated vllm works on k8s~~ [WIP] feat(data-plane): Enhance NCCL data plane configuration and improve logging Jan 28, 2025

ishandhanani changed the title ~~[WIP] feat(data-plane): Enhance NCCL data plane configuration and improve logging~~ feat(data-plane): Enhance NCCL data plane configuration and improve logging Jan 28, 2025

ishandhanani changed the title ~~feat(data-plane): Enhance NCCL data plane configuration and improve logging~~ feat(data-plane): improve NCCL data plane configuration Jan 28, 2025

precommmit

e634d46

ishandhanani temporarily deployed to GITLAB January 28, 2025 19:13 — with GitHub Actions Inactive

ishandhanani temporarily deployed to GITLAB January 28, 2025 19:14 — with GitHub Actions Inactive

ishandhanani marked this pull request as ready for review January 28, 2025 19:15

nnshah1 requested review from nnshah1, piotrm-nvidia and ptarasiewiczNV January 28, 2025 19:54

nnshah1 reviewed Jan 28, 2025

View reviewed changes

nnshah1 approved these changes Jan 28, 2025

View reviewed changes

Merge branch 'main' into ishan/k8s-vllm

df6fd32

nnshah1 temporarily deployed to GITLAB January 28, 2025 20:12 — with GitHub Actions Inactive

nnshah1 temporarily deployed to GITLAB January 28, 2025 20:17 — with GitHub Actions Inactive

piotrm-nvidia reviewed Jan 28, 2025

View reviewed changes

Merge branch 'main' into ishan/k8s-vllm

8f1a037

nnshah1 temporarily deployed to GITLAB January 29, 2025 07:20 — with GitHub Actions Inactive

nnshah1 temporarily deployed to GITLAB January 29, 2025 07:21 — with GitHub Actions Inactive

ptarasiewiczNV reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-plane): improve NCCL data plane configuration #69

feat(data-plane): improve NCCL data plane configuration #69

ishandhanani commented Jan 28, 2025 •

edited

Loading

nnshah1 Jan 28, 2025

piotrm-nvidia Jan 28, 2025

ptarasiewiczNV Jan 29, 2025

nnshah1 Jan 28, 2025

piotrm-nvidia Jan 28, 2025

piotrm-nvidia Jan 28, 2025 •

edited

Loading

piotrm-nvidia Jan 28, 2025

nnshah1 left a comment

piotrm-nvidia Jan 28, 2025

ptarasiewiczNV left a comment

ptarasiewiczNV Jan 29, 2025

feat(data-plane): improve NCCL data plane configuration #69

Are you sure you want to change the base?

feat(data-plane): improve NCCL data plane configuration #69

Conversation

ishandhanani commented Jan 28, 2025 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotrm-nvidia Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnshah1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptarasiewiczNV left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ishandhanani commented Jan 28, 2025 •

edited

Loading

piotrm-nvidia Jan 28, 2025 •

edited

Loading