-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(data-plane): improve NCCL data plane configuration #69
base: main
Are you sure you want to change the base?
Conversation
@@ -39,6 +39,7 @@ | |||
from triton_distributed.icp.ucp_data_plane import DataPlaneError, UcpDataPlane | |||
|
|||
logger = logging.getLogger(__name__) | |||
logger.setLevel(logging.DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be set via arg somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to deployment folder scripts if it is not already supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._port = port | ||
self._bind_hostname = bind_hostname or socket.gethostname() | ||
self._advertise_hostname = advertise_hostname or self._bind_hostname | ||
self._port = port or (13337 + torch.distributed.get_rank()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make 13337 a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need environment variable. Pls consider VLLM_DATA_PLANE_MIN_PORT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important scenario is with 4 workers running in single node and using 4 ports in single machine. Is it going to work with your change? I'm afraid that you need to define env variable with coma separate list of allowed ports to use and code should select the allowed port based on rank and this must be adjusted based on amount of workers running in the single machine in external scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing config with two ports used at
INFO 01-28 13:00:17 data_plane.py:153] Rank 1 binding to node:13338
INFO 01-28 13:00:17 data_plane.py:161] Rank 1 connected to the server
INFO 01-28 13:00:17 kv_cache.py:64] Store set up
INFO 01-28 13:00:17 kv_cache.py:86] KVCacheHandler initialized
INFO 01-28 13:00:21 llm_engine.py:497] Maximum concurrency for 5 sequences and 131072 tokens per request: 5.44x
INFO 01-28 13:00:21 gpu_executor.py:122] # GPU blocks: 44577, # CPU blocks: 4096
INFO 01-28 13:00:21 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 5.44x
INFO 01-28 13:00:22 data_plane.py:153] Rank 0 binding to node:13337
INFO 01-28 13:00:22 data_plane.py:161] Rank 0 connected to the server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - @piotrm-nvidia or @ptarasiewiczNV - can you verify?
if envs.VLLM_WORKER_ID >= envs.VLLM_CONTEXT_WORKERS: | ||
# bind to actual host but advertise a service | ||
self._data_plane = VllmNcclDataPlane( | ||
bind_hostname=socket.gethostname(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that actual bind host and host name send to other workers must be different for other hosts to successfully connect.
Can you draw some topology how it supposed to look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just need to address @piotrm-nvidia concerns.
@@ -39,6 +39,7 @@ | |||
from triton_distributed.icp.ucp_data_plane import DataPlaneError, UcpDataPlane | |||
|
|||
logger = logging.getLogger(__name__) | |||
logger.setLevel(logging.DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the PR do?
In order to make this example works in a K8s setting, we need the prefill node to be able to communicate to the decode node. However, the prefill worker cannot resolve another pod's hostname. In order to solve this, we expose a decode service and save it for the prefill worker to access via
VLLM_DATA_PLANE_HOSTNAME
.Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
This was run using the current README examples for a single container and was successful. Relevant snippets are below
Prefill/decode worker output
Inference result
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)