Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft launching script #70

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

piotrm-nvidia
Copy link
Contributor

No description provided.

@piotrm-nvidia
Copy link
Contributor Author

Works for TP==1

This works if TP size for context is 1:

python3 launch_workers.py --log-level DEBUG --context-workers 2 --context-tp-size 1 --generate-workers 1 --generate-tp-size 1

The curl return successful result.

Fails for TP>1

When context TP is set above 1, then logs contain error about connecting to port 36183:

[W128 13:38:34.115654817 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).

Reproduction

python3 launch_workers.py --log-level DEBUG --context-workers 2 --context-tp-size 2 --generate-workers 1 --generate-tp-size 2

log:

launch_workers.py: INFO: main(): 648:   Namespace(model_ckpt='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', model_name='llama', log_dir='/logs', dry_run=False, context_tp_size=2, generate_tp_size=2, baseline_tp_size=1, context_max
_batch_size=10000, generate_max_batch_size=10000, baseline_max_batch_size=10000, log_level='DEBUG', nats_url='node:4223', nats_store='/tmp/nats/triton-3-demo', nats_debug=False, context_workers=2, generate_workers=1, baseline_
workers=None, api_server_url='http://node:8005', workers_only=False, max_model_len=None, max_num_seqs=-1, context_max_num_seqs=-1, generate_max_num_seqs=-1, benchmark=False, benchmark_timeout=300, isl_cached=0, isl_uncached=20
48, osl=128, load_type='concurrency', load_value=[32], request_count_per_load_value=100, min_request_count=None, data_plane_backend='nccl', enable_chunked_prefill=False, enable_prefix_caching=False, artifact_dir='artifacts', base
line_gpu_memory_utilization=0.9, context_gpu_memory_utilization=0.9, generate_gpu_memory_utilization=0.5, profile_workers=False, visible_devices=[0, 1, 2, 3, 4, 5, 6, 7], host='node')
launch_workers.py: INFO: main(): 649:   Example root: /
launch_workers.py: INFO: _launch_workers(): 416:        Launching worker with CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_ID=0 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --request-plane-uri node:4223 --context-
worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9 --initialize-request-plane
launch_workers.py: INFO: _launch_workers(): 416:        Launching worker with CUDA_VISIBLE_DEVICES=2,3 VLLM_WORKER_ID=1 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --request-plane-uri node:4223 --context-
worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_workers(): 416:        Launching worker with CUDA_VISIBLE_DEVICES=4,5 VLLM_WORKER_ID=2 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --request-plane-uri node:4223 --context-
worker-count 0 --generate-worker-count 1 --max-batch-size 10000 --gpu-memory-utilization 0.5
launch_workers.py: INFO: _launch_api_server(): 126:     Launching API server: python3 -m llm.api_server --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --request-plane-uri node:4223 --api-server-host node --api-serve
r-port 8005 --model-name llama
launch_workers.py: INFO: main(): 683:   waiting for <Popen: returncode: None args: ['python3', '-m', 'llm.vllm.deploy', '--model...>
WARNING 01-28 13:38:20 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
WARNING 01-28 13:38:20 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
13:38:21 deployment.py:116[triton_distributed.worker.deployment] INFO:

Starting Worker:

        Config:
        WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7fff175b5f80>,
             request_plane_args=([], {'request_plane_uri': 'node:4223'}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='llama',
                                       implementation=<class 'llm.vllm.operators.vllm.VllmContextOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=1000,
                                       parameters={'baseline_tp_size': 1,
                                                   'baseline_worker_count': 0,
                                                   'context_tp_size': 1,
                                                   'context_worker_count': 1,
                                                   'disable_async_output_proc': True,
                                                   'disable_log_stats': True,
                                                   'dtype': 'auto',
                                                   'dummy_worker_count': 0,
                                                   'enable_chunked_prefill': False,
                                                   'enable_prefix_caching': False,
                                                   'enforce_eager': False,
                                                   'generate_tp_size': 1,
                                                   'generate_worker_count': 0,
                                                   'gpu_memory_utilization': 0.9,
                                                   'ignore_eos': False,
                                                   'initialize_request_plane': True,
                                                   'kv_cache_dtype': 'fp8',
                                                   'log_dir': '',
                                                   'log_level': 1,
                                                   'max_batch_size': 10000,
                                                   'max_model_len': None,
                                                   'max_num_seqs': None,
                                                   'model_name': 'neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8',
                                                   'nats_url': 'nats://localhost:4223',
                                                   'request_plane_uri': 'node:4223',
                                                   'starting_metrics_port': 0,
                                                   'worker_name': 'llama'},
                                       log_level=None)],
             triton_log_path=None,
             name='llama.0',
             log_dir='',
             metrics_port=0)
        <SpawnProcess name='llama.0' parent=2971261 initial>

Workers started ... press Ctrl-C to Exit
[2972184] 2025/01/28 13:38:21.925873 [INF] Starting nats-server
[2972184] 2025/01/28 13:38:21.926005 [INF]   Version:  2.10.24
[2972184] 2025/01/28 13:38:21.926007 [INF]   Git:      [1d6f7ea]
[2972184] 2025/01/28 13:38:21.926009 [INF]   Name:     NDCCC54S4EE7UKEYVQGVDOTZCRAOSHOHW43SML5IIEXDSXLHULSP66DX
[2972184] 2025/01/28 13:38:21.926013 [INF]   Node:     09ObDJPK
[2972184] 2025/01/28 13:38:21.926014 [INF]   ID:       NDCCC54S4EE7UKEYVQGVDOTZCRAOSHOHW43SML5IIEXDSXLHULSP66DX
[2972184] 2025/01/28 13:38:21.926443 [INF] Starting JetStream
[2972184] 2025/01/28 13:38:21.926579 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[2972184] 2025/01/28 13:38:21.926583 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[2972184] 2025/01/28 13:38:21.926604 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[2972184] 2025/01/28 13:38:21.926606 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[2972184] 2025/01/28 13:38:21.926609 [INF]
[2972184] 2025/01/28 13:38:21.926611 [INF]          https://docs.nats.io/jetstream
[2972184] 2025/01/28 13:38:21.926614 [INF]
[2972184] 2025/01/28 13:38:21.926616 [INF] ---------------- JETSTREAM ----------------
[2972184] 2025/01/28 13:38:21.926619 [INF]   Max Memory:      1.48 TB
[2972184] 2025/01/28 13:38:21.926622 [INF]   Max Storage:     755.84 GB
[2972184] 2025/01/28 13:38:21.926624 [INF]   Store Directory: "/tmp/nats_store/jetstream"
[2972184] 2025/01/28 13:38:21.926626 [INF] -------------------------------------------
[2972184] 2025/01/28 13:38:21.928631 [INF] Listening for client connections on 0.0.0.0:4223
[2972184] 2025/01/28 13:38:21.930827 [INF] Server is ready
13:38:22 deployment.py:116[triton_distributed.worker.deployment] INFO:

Starting Worker:

        Config:
        WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7fff175b5f80>,
             request_plane_args=([], {'request_plane_uri': 'node:4223'}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='llama',
                                       implementation=<class 'llm.vllm.operators.vllm.VllmContextOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=1000,
                                       parameters={'baseline_tp_size': 1,
                                                   'baseline_worker_count': 0,
                                                   'context_tp_size': 1,
                                                   'context_worker_count': 1,
                                                   'disable_async_output_proc': True,
                                                   'disable_log_stats': True,
                                                   'dtype': 'auto',
                                                   'dummy_worker_count': 0,
                                                   'enable_chunked_prefill': False,
                                                   'enable_prefix_caching': False,
                                                   'enforce_eager': False,
                                                   'generate_tp_size': 1,
                                                   'generate_worker_count': 0,
                                                   'gpu_memory_utilization': 0.9,
                                                   'ignore_eos': False,
                                                   'initialize_request_plane': False,
                                                   'kv_cache_dtype': 'fp8',
                                                   'log_dir': '',
                                                   'log_level': 1,
                                                   'max_batch_size': 10000,
                                                   'max_model_len': None,
                                                   'max_num_seqs': None,
                                                   'model_name': 'neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8',
                                                   'nats_url': 'nats://localhost:4223',
                                                   'request_plane_uri': 'node:4223',
                                                   'starting_metrics_port': 0,
                                                   'worker_name': 'llama'},
                                       log_level=None)],
             triton_log_path=None,
             name='llama.0',
             log_dir='',
             metrics_port=0)
        <SpawnProcess name='llama.0' parent=2971392 initial>

Workers started ... press Ctrl-C to Exit
WARNING 01-28 13:38:22 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
Namespace(api_server_host='node', api_server_port=8005, request_plane_uri='node:4223', data_plane_host=None, data_plane_port=0, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', model_name='llama', log_level=1, progra
m_name='OpenAI API Sever')
13:38:23 __main__.py:28[OpenAI API Sever] INFO: Starting
[WARNING] Adding CORS for the following origins: ['http://localhost']
INFO:     Started server process [2971700]
INFO:     Waiting for application startup.
TRACE:    ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}, 'state': {}}
TRACE:    ASGI [1] Receive {'type': 'lifespan.startup'}
TRACE:    ASGI [1] Send {'type': 'lifespan.startup.complete'}
INFO:     Application startup complete.
INFO:     Uvicorn running on http://node:8005 (Press CTRL+C to quit)
13:38:24 deployment.py:116[triton_distributed.worker.deployment] INFO:

Starting Worker:

        Config:
        WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7fff175b1f80>,
             request_plane_args=([], {'request_plane_uri': 'node:4223'}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='generate',
                                       implementation=<class 'llm.vllm.operators.vllm.VllmGenerateOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=1000,
                                       parameters={'baseline_tp_size': 1,
                                                   'baseline_worker_count': 0,
                                                   'context_tp_size': 1,
                                                   'context_worker_count': 0,
                                                   'disable_async_output_proc': True,
                                                   'disable_log_stats': True,
                                                   'dtype': 'auto',
                                                   'dummy_worker_count': 0,
                                                   'enable_chunked_prefill': False,
                                                   'enable_prefix_caching': False,
                                                   'enforce_eager': False,
                                                   'generate_tp_size': 1,
                                                   'generate_worker_count': 1,
                                                   'gpu_memory_utilization': 0.5,
                                                   'ignore_eos': False,
                                                   'initialize_request_plane': False,
                                                   'kv_cache_dtype': 'fp8',
                                                   'log_dir': '',
                                                   'log_level': 1,
                                                   'max_batch_size': 10000,
                                                   'max_model_len': None,
                                                   'max_num_seqs': None,
                                                   'model_name': 'neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8',
                                                   'nats_url': 'nats://localhost:4223',
                                                   'request_plane_uri': 'node:4223',
                                                   'starting_metrics_port': 0,
                                                   'worker_name': 'llama'},
                                       log_level=None)],
             triton_log_path=None,
             name='generate.0',
             log_dir='',
             metrics_port=0)
        <SpawnProcess name='generate.0' parent=2971568 initial>

Workers started ... press Ctrl-C to Exit
WARNING 01-28 13:38:26 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
WARNING 01-28 13:38:26 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
WARNING 01-28 13:38:30 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
INFO 01-28 13:38:33 config.py:653] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-28 13:38:33 config.py:653] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 01-28 13:38:33 arg_utils.py:967] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider settin
g --max-model-len to a smaller value.
WARNING 01-28 13:38:33 arg_utils.py:967] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider settin
g --max-model-len to a smaller value.
INFO 01-28 13:38:33 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post2.dev16+gf61960ce) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-I
nstruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131
072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_p
ath=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=Fa
lse), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached
_outputs=False, mm_processor_kwargs=None)
INFO 01-28 13:38:33 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post2.dev16+gf61960ce) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-I
nstruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131
072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_p
ath=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=Fa
lse), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached
_outputs=False, mm_processor_kwargs=None)
INFO 01-28 13:38:33 selector.py:141] Using Flashinfer backend.
INFO 01-28 13:38:33 selector.py:141] Using Flashinfer backend.
INFO 01-28 13:38:34 parallel_state.py:939] ==================================================
INFO 01-28 13:38:34 parallel_state.py:940] Patching init_distributed_environment
INFO 01-28 13:38:34 parallel_state.py:942] Stage: PREFILL
[W128 13:38:34.115654817 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.115855462 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116029983 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116208041 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116361675 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116533762 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116697932 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116871732 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117042191 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117200333 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117361510 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117524740 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
INFO 01-28 13:38:34 parallel_state.py:939] ==================================================
INFO 01-28 13:38:34 parallel_state.py:940] Patching init_distributed_environment
INFO 01-28 13:38:34 parallel_state.py:942] Stage: PREFILL
INFO 01-28 13:38:35 parallel_state.py:958] world_size: 3, rank: 1, distributed_init_method: None, local_rank: 0, backend: nccl
DEBUG 01-28 13:38:35 parallel_state.py:960] world_size=3 rank=1 local_rank=0 distributed_init_method=None backend=nccl
INFO 01-28 13:38:36 config.py:653] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 01-28 13:38:36 arg_utils.py:967] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider settin
g --max-model-len to a smaller value.
INFO 01-28 13:38:36 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post2.dev16+gf61960ce) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-I
nstruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131
072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_
path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=F
alse), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cache
d_outputs=False, mm_processor_kwargs=None)
INFO 01-28 13:38:37 selector.py:141] Using Flashinfer backend.
INFO 01-28 13:38:37 parallel_state.py:939] ==================================================
INFO 01-28 13:38:37 parallel_state.py:940] Patching init_distributed_environment
INFO 01-28 13:38:37 parallel_state.py:942] Stage: GENERATE
INFO 01-28 13:38:37 parallel_state.py:958] world_size: 6, rank: 4, distributed_init_method: None, local_rank: 0, backend: nccl
DEBUG 01-28 13:38:37 parallel_state.py:960] world_size=6 rank=4 local_rank=0 distributed_init_method=None backend=nccl
INFO 01-28 13:38:37 parallel_state.py:958] world_size: 3, rank: 0, distributed_init_method: None, local_rank: 0, backend: nccl
DEBUG 01-28 13:38:37 parallel_state.py:960] world_size=3 rank=0 local_rank=0 distributed_init_method=None backend=nccl

@piotrm-nvidia
Copy link
Contributor Author

The execution outside launching script is successful for config below:

export VLLM_DATA_PLANE_BACKEND=nccl
export VLLM_TORCH_HOST=localhost
export VLLM_GENERATE_WORKERS=1
export VLLM_TORCH_PORT=36183
export VLLM_CONTEXT_TP_SIZE=2
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_BASELINE_WORKERS=0
export VLLM_GENERATE_TP_SIZE=2
export VLLM_LOGGING_LEVEL=INFO
export VLLM_CONTEXT_WORKERS=1
export VLLM_BASELINE_TP_SIZE=1
CUDA_VISIBLE_DEVICES=0,1 \
VLLM_WORKER_ID=0 \
python3 -m llm.vllm.deploy \
  --context-worker-count 1 \
  --request-plane-uri ${HOSTNAME}:4223 \
  --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --kv-cache-dtype fp8 \
  --dtype auto \
  --worker-name llama \
  --disable-async-output-proc \
  --disable-log-stats \
  --max-model-len 3500 \
  --max-batch-size 10000 \
  --gpu-memory-utilization 0.9 \
  --context-tp-size 2 \
  --generate-tp-size 2 \
  --initialize-request-plane &
CUDA_VISIBLE_DEVICES=2,3 \
VLLM_WORKER_ID=1 \
python3 -m llm.vllm.deploy \
  --generate-worker-count 1 \
  --request-plane-uri ${HOSTNAME}:4223 \
  --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --kv-cache-dtype fp8 \
  --dtype auto \
  --worker-name llama \
  --disable-async-output-proc \
  --disable-log-stats \
  --max-model-len 3500 \
  --max-batch-size 10000 \
  --gpu-memory-utilization 0.9 \
  --context-tp-size 2 \
  --generate-tp-size 2 &
python3 -m llm.api_server \
  --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --request-plane-uri ${HOSTNAME}:4223 \
  --api-server-host ${HOSTNAME} \
  --model-name llama \
  --api-server-port 8005 &

@piotrm-nvidia
Copy link
Contributor Author

You can test launching script at single 8xH100 machine:

python3 launch_workers.py \
    --log-level INFO \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --context-workers 2 \
    --context-tp-size 2 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1  \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <YOUR DIR>

It should print genai-perf output tables with performance and save results in YOUR DIR.

@piotrm-nvidia piotrm-nvidia requested a review from glos-nv January 29, 2025 13:10
@piotrm-nvidia
Copy link
Contributor Author

piotrm-nvidia commented Jan 29, 2025

I executed slurm cluster sbatch launch script at two nodes:

#!/bin/bash
#SBATCH --partition=<PARTITION>
#SBATCH --account=<ACCOUNT>
#SBATCH --job-name=<JOB>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=2:00:00
#SBATCH --mem=0 \
#SBATCH --no-kill \
#SBATCH --exclusive
#SBATCH --output=<ARTIFACTS>/%x_%j_%n_%N.out              ### Slurm Output file, %x is job name, %j is job id
#SBATCH --error=<ARTIFACTS>/%x_%j_%n_%N.err               ### Slurm Error file, %x is job 

set -e
set -x

export HF_TOKEN=<TOKEN>

export JOB_DIR=<ARTIFACTS>
export LOGDIR=${JOB_DIR}/logs
export PROFILESDIR=${JOB_DIR}/profiles
export SCHEDULER_FILE=$LOGDIR/scheduler.json
export SCHEDULER_LOG=$LOGDIR/scheduler.log
export DONE_MARKER=$LOGDIR/done.txt

export DEVICE="gpu"

export INTERFACE="eth3"
export PROTOCOL="tcp"

export CPU_WORKER_MEMORY_LIMIT="14GB"
export RAPIDS_NO_INITIALIZE="1"
export CUDF_SPILL="1"
export RMM_SCHEDULER_POOL_SIZE="1GB"
export RMM_WORKER_POOL_SIZE="72GiB"
export LIBCUDF_CUFILE_POLICY=OFF
export DASK_DATAFRAME__QUERY_PLANNING=False

mkdir -p $LOGDIR
mkdir -p $PROFILESDIR

srun --container-mounts=/lustre/fsw/:/lustre/fsw/ --container-image=<CONTAINER> bash -c "python3 launch_workers.py \
    --log-level INFO \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --context-workers 2 \
    --context-tp-size 2 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1 1 32 \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <ARTIFACTS>"

The output from logs suggest that NATS.io host is not propagating into any of NATS.io related components correctly.

python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 0 --generate-worker-count 1 --max-batch-size 10000 --gpu-memory-utilization 0.5
launch_workers.py: INFO: main(): 701:   Namespace(model_ckpt='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', model_name='llama', log_dir='/lustre/fsw/coreai_tritoninference_triton3/piotrm/src/triton-distributed/examples/llm/logs', dry_run=False, context_tp_size=2, generate_tp_size=8, baseline_tp_size=1, context_max_batch_size=10000, generate_max_batch_size=10000, baseline_max_batch_size=10000, log_level='INFO', nats_url='nats://node:4223', nats_store='/tmp/nats/triton-3-demo', nats_debug=False, context_workers=4, generate_workers=1, baseline_workers=None, api_server_url='http://node:8005', workers_only=False, max_model_len=None, max_num_seqs=-1, context_max_num_seqs=-1, generate_max_num_seqs=-1, benchmark=True, benchmark_timeout=800, isl_cached=0, isl_uncached=3000, osl=150, load_type='concurrency', load_value=[1, 1, 16, 32], request_count_per_load_value=10, min_request_count=20, data_plane_backend='nccl', enable_chunked_prefill=False, enable_prefix_caching=False, artifact_dir='/lustre/fsw/coreai_tritoninference_triton3/piotrm/src/triton-distributed/art_0129_v2', baseline_gpu_memory_utilization=0.9, context_gpu_memory_utilization=0.9, generate_gpu_memory_utilization=0.5, profile_workers=False, visible_devices=[0, 1, 2, 3, 4, 5, 6, 7], host='node')
launch_workers.py: INFO: main(): 702:   Example root: /lustre/fsw/coreai_tritoninference_triton3/piotrm/src/triton-distributed/examples/llm
launch_workers.py: INFO: _launch_workers(): 320:        Ntasks configuration 2
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_ID=0 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9 --initialize-request-plane
launch_workers.py: INFO: main(): 738:   waiting for <Popen: returncode: None args: ['python3', '-m', 'llm.vllm.deploy', '--model...>
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=2,3 VLLM_WORKER_ID=1 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=4,5 VLLM_WORKER_ID=2 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=6,7 VLLM_WORKER_ID=3 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_api_server(): 133:     Launching API server: python3 -m llm.api_server --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --request-plane-uri nats://node:4223 --api-server-host node --api-server-port 8005 --model-name llama
launch_workers.py: WARNING: wait_for_server(): 569:     Server not responding: HTTPConnectionPool(host='node', port=8005): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffff75afad0>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2469896] 2025/01/29 05:25:02.968507 [INF] Starting nats-server
[2469896] 2025/01/29 05:25:02.968632 [INF]   Version:  2.10.24
[2469896] 2025/01/29 05:25:02.968635 [INF]   Git:      [1d6f7ea]
[2469896] 2025/01/29 05:25:02.968638 [INF]   Name:     NAAHV7QMT2RZBP3GB46O4ANWCSFHMB7M264RQCMB2BO2GR2GFY3FAR32
[2469896] 2025/01/29 05:25:02.968644 [INF]   Node:     zBBWatmk
[2469896] 2025/01/29 05:25:02.968645 [INF]   ID:       NAAHV7QMT2RZBP3GB46O4ANWCSFHMB7M264RQCMB2BO2GR2GFY3FAR32
[2469896] 2025/01/29 05:25:02.968909 [INF] Starting JetStream
[2469896] 2025/01/29 05:25:02.969007 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[2469896] 2025/01/29 05:25:02.969009 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[2469896] 2025/01/29 05:25:02.969010 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[2469896] 2025/01/29 05:25:02.969011 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[2469896] 2025/01/29 05:25:02.969012 [INF]
[2469896] 2025/01/29 05:25:02.969013 [INF]          https://docs.nats.io/jetstream
[2469896] 2025/01/29 05:25:02.969014 [INF]
[2469896] 2025/01/29 05:25:02.969014 [INF] ---------------- JETSTREAM ----------------
[2469896] 2025/01/29 05:25:02.969016 [INF]   Max Memory:      1.48 TB
[2469896] 2025/01/29 05:25:02.969018 [INF]   Max Storage:     755.84 GB
[2469896] 2025/01/29 05:25:02.969019 [INF]   Store Directory: "/tmp/nats_store/jetstream"
[2469896] 2025/01/29 05:25:02.969020 [INF] -------------------------------------------
[2469896] 2025/01/29 05:25:02.970594 [INF] Listening for client connections on 0.0.0.0:4223

...

[rank1]:[W129 05:28:40.599990280 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank12]:[W129 05:28:40.644431120 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank0]:[W129 05:28:40.644566749 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank8]:[W129 05:28:40.699022478 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank9]:[W129 05:28:40.999093155 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank13]:[W129 05:28:40.152316566 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank10]:[W129 05:28:41.636855564 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank14]:[W129 05:28:41.744224705 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank11]:[W129 05:28:41.275960017 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank15]:[W129 05:28:41.314099102 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
nats: encountered error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1130, in create_connection
    raise OSError('Multiple exceptions: {}'.format(
OSError: Multiple exceptions: [Errno 111] Connect call failed ('::1', 4223, 0, 0), [Errno 111] Connect call failed ('127.0.0.1', 4223)
nats: encountered error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1130, in create_connection
    raise OSError('Multiple exceptions: {}'.format(
OSError: Multiple exceptions: [Errno 111] Connect call failed ('::1', 4223, 0, 0), [Errno 111] Connect call failed ('127.0.0.1', 4223)
nats: encountered error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server

@piotrm-nvidia
Copy link
Contributor Author

Benchmark command for context TP=2 DP=2 generate TP=4 DP=1:

python3 launch_workers.py \
    --log-level INFO \
    --nats-url <HOST>:4223 \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --context-workers 2 \
    --context-tp-size 2 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --max-model-len 3500 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1 1 2 4 8 61 32 64 128 256  \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <FOLDER>/art_0129_70b_single_node

Benchmark command for context TP=4 DP=1 generate TP=4 DP=1:

python3 launch_workers.py \
    --log-level INFO \
    --nats-url <HOST>:4223 \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --context-workers 1 \
    --context-tp-size 4 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --max-model-len 3500 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1 1 2 4 8 61 32 64 128 256  \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <FOLDER>/art_0129_70b_single_node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant