Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-plane): improve NCCL data plane configuration #69

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Jan 28, 2025

What does the PR do?

In order to make this example works in a K8s setting, we need the prefill node to be able to communicate to the decode node. However, the prefill worker cannot resolve another pod's hostname. In order to solve this, we expose a decode service and save it for the prefill worker to access via VLLM_DATA_PLANE_HOSTNAME.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

This was run using the current README examples for a single container and was successful. Relevant snippets are below

Prefill/decode worker output

INFO 01-28 18:57:16 llm_engine.py:497] Maximum concurrency for 1 sequences and 3500 tokens per request: 347.78x
INFO 01-28 18:57:16 llm_engine.py:493] Profiling with 347 sequences
INFO 01-28 18:57:16 llm_engine.py:497] Maximum concurrency for 347 sequences and 3500 tokens per request: 341.39x
INFO 01-28 18:57:16 llm_engine.py:493] Profiling with 341 sequences
INFO 01-28 18:57:16 llm_engine.py:497] Maximum concurrency for 341 sequences and 3500 tokens per request: 341.49x
INFO 01-28 18:57:16 gpu_executor.py:122] # GPU blocks: 74701, # CPU blocks: 4096
INFO 01-28 18:57:16 gpu_executor.py:126] Maximum concurrency for 3500 tokens per request: 341.49x
INFO 01-28 18:57:19 data_plane.py:152] Rank 0 binding to brev-h100-9-gpu01:13337
INFO 01-28 18:57:19 data_plane.py:153] Advertising to brev-h100-9-gpu01:13337
INFO 01-28 18:57:19 data_plane.py:161] Rank 0 connected to the server
INFO 01-28 18:57:19 kv_cache.py:75] Store set up
INFO 01-28 18:57:19 kv_cache.py:97] KVCacheHandler initialized
INFO 01-28 18:57:21 model_runner.py:1406] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 18:57:21 model_runner.py:1410] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 18:57:39 model_runner.py:1534] Graph capturing finished in 19 secs.
INFO 01-28 18:57:39 data_plane.py:152] Rank 1 binding to brev-h100-9-gpu01:13338
INFO 01-28 18:57:39 data_plane.py:153] Advertising to brev-h100-9-gpu01:13338
INFO 01-28 18:57:39 data_plane.py:161] Rank 1 connected to the server
INFO 01-28 18:57:39 kv_cache.py:75] Store set up
INFO 01-28 18:57:39 kv_cache.py:97] KVCacheHandler initialized
INFO 01-28 18:57:40 llm_engine.py:520] Setting max_num_seqs to 341
INFO 01-28 18:57:40 llm_engine.py:520] Setting max_num_seqs to 341
18:57:41 worker.py:266[Triton Worker] INFO: Worker started...
18:57:41 worker.py:241[Triton Worker] INFO: Starting generate handler...
18:57:41 worker.py:266[Triton Worker] INFO: Worker started...
18:57:41 worker.py:241[Triton Worker] INFO: Starting llama handler...

Inference result

INFO 01-28 19:05:04 async_llm_engine.py:207] Added request b76d58eb-77dd-443d-bbf8-da01326b051f.
INFO 01-28 19:05:04 async_llm_engine.py:175] Finished request b76d58eb-77dd-443d-bbf8-da01326b051f.
<SNIP>
INFO 01-28 19:05:04 async_llm_engine.py:207] Added request b76d58eb-77dd-443d-bbf8-da01326b051f___0.
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<290 bytes>', 'more_body': True}
data: {"id":"b76d58eb-77dd-443d-bbf8-da01326b051f","choices":[{"delta":{"content":"\n\n","role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1738091104,"model":"llama","system_fingerprint":"b76d58eb-77dd-443d-bbf8-da01326b051f","object":"chat.completion.chunk"}

INFO 01-28 19:05:04 async_llm_engine.py:175] Finished request b76d58eb-77dd-443d-bbf8-da01326b051f___0.
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<317 bytes>', 'more_body': True}
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<14 bytes>', 'more_body': True}
TRACE:    127.0.0.1:35186 - ASGI [4] Send {'type': 'http.response.body', 'body': '<0 bytes>', 'more_body': False}
TRACE:    127.0.0.1:35186 - ASGI [4] Receive {'type': 'http.disconnect'}
data: {"id":"b76d58eb-77dd-443d-bbf8-da01326b051f","choices":[{"delta":{"content":"The capital of France is Paris.","role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1738091104,"model":"llama","system_fingerprint":"b76d58eb-77dd-443d-bbf8-da01326b051f","object":"chat.completion.chunk"}
  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@ishandhanani ishandhanani changed the title [WIP] feat(llm): ensure disaggregated vllm works on k8s [WIP] feat(data-plane): Enhance NCCL data plane configuration and improve logging Jan 28, 2025
@ishandhanani ishandhanani changed the title [WIP] feat(data-plane): Enhance NCCL data plane configuration and improve logging feat(data-plane): Enhance NCCL data plane configuration and improve logging Jan 28, 2025
@ishandhanani ishandhanani changed the title feat(data-plane): Enhance NCCL data plane configuration and improve logging feat(data-plane): improve NCCL data plane configuration Jan 28, 2025
@@ -39,6 +39,7 @@
from triton_distributed.icp.ucp_data_plane import DataPlaneError, UcpDataPlane

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be set via arg somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to deployment folder scripts if it is not already supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._port = port
self._bind_hostname = bind_hostname or socket.gethostname()
self._advertise_hostname = advertise_hostname or self._bind_hostname
self._port = port or (13337 + torch.distributed.get_rank())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make 13337 a constant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need environment variable. Pls consider VLLM_DATA_PLANE_MIN_PORT

Copy link
Contributor

@piotrm-nvidia piotrm-nvidia Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important scenario is with 4 workers running in single node and using 4 ports in single machine. Is it going to work with your change? I'm afraid that you need to define env variable with coma separate list of allowed ports to use and code should select the allowed port based on rank and this must be adjusted based on amount of workers running in the single machine in external scripts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing config with two ports used at

INFO 01-28 13:00:17 data_plane.py:153] Rank 1 binding to node:13338                                                                                                                                       
INFO 01-28 13:00:17 data_plane.py:161] Rank 1 connected to the server                                                                                                                                                                
INFO 01-28 13:00:17 kv_cache.py:64] Store set up                                                                  
INFO 01-28 13:00:17 kv_cache.py:86] KVCacheHandler initialized                                                                                                                                                                       
INFO 01-28 13:00:21 llm_engine.py:497] Maximum concurrency for 5 sequences and 131072 tokens per request: 5.44x                                                                                                                      
INFO 01-28 13:00:21 gpu_executor.py:122] # GPU blocks: 44577, # CPU blocks: 4096                                                                                                                                                     
INFO 01-28 13:00:21 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 5.44x                                                                                                                                    
INFO 01-28 13:00:22 data_plane.py:153] Rank 0 binding to node:13337                                                                                                                                       
INFO 01-28 13:00:22 data_plane.py:161] Rank 0 connected to the server  

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - @piotrm-nvidia or @ptarasiewiczNV - can you verify?

if envs.VLLM_WORKER_ID >= envs.VLLM_CONTEXT_WORKERS:
# bind to actual host but advertise a service
self._data_plane = VllmNcclDataPlane(
bind_hostname=socket.gethostname(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that actual bind host and host name send to other workers must be different for other hosts to successfully connect.

Can you draw some topology how it supposed to look like?

Copy link
Contributor

@ptarasiewiczNV ptarasiewiczNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just need to address @piotrm-nvidia concerns.

@@ -39,6 +39,7 @@
from triton_distributed.icp.ucp_data_plane import DataPlaneError, UcpDataPlane

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants