Add draft launching script #70

piotrm-nvidia · 2025-01-28T21:28:54Z

No description provided.

piotrm-nvidia · 2025-01-28T22:13:08Z

Works for TP==1

This works if TP size for context is 1:

python3 launch_workers.py --log-level DEBUG --context-workers 2 --context-tp-size 1 --generate-workers 1 --generate-tp-size 1

The curl return successful result.

Fails for TP>1

When context TP is set above 1, then logs contain error about connecting to port 36183:

[W128 13:38:34.115654817 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).

Reproduction

python3 launch_workers.py --log-level DEBUG --context-workers 2 --context-tp-size 2 --generate-workers 1 --generate-tp-size 2

log:

launch_workers.py: INFO: main(): 648:   Namespace(model_ckpt='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', model_name='llama', log_dir='/logs', dry_run=False, context_tp_size=2, generate_tp_size=2, baseline_tp_size=1, context_max
_batch_size=10000, generate_max_batch_size=10000, baseline_max_batch_size=10000, log_level='DEBUG', nats_url='node:4223', nats_store='/tmp/nats/triton-3-demo', nats_debug=False, context_workers=2, generate_workers=1, baseline_
workers=None, api_server_url='http://node:8005', workers_only=False, max_model_len=None, max_num_seqs=-1, context_max_num_seqs=-1, generate_max_num_seqs=-1, benchmark=False, benchmark_timeout=300, isl_cached=0, isl_uncached=20
48, osl=128, load_type='concurrency', load_value=[32], request_count_per_load_value=100, min_request_count=None, data_plane_backend='nccl', enable_chunked_prefill=False, enable_prefix_caching=False, artifact_dir='artifacts', base
line_gpu_memory_utilization=0.9, context_gpu_memory_utilization=0.9, generate_gpu_memory_utilization=0.5, profile_workers=False, visible_devices=[0, 1, 2, 3, 4, 5, 6, 7], host='node')
launch_workers.py: INFO: main(): 649:   Example root: /
launch_workers.py: INFO: _launch_workers(): 416:        Launching worker with CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_ID=0 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --request-plane-uri node:4223 --context-
worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9 --initialize-request-plane
launch_workers.py: INFO: _launch_workers(): 416:        Launching worker with CUDA_VISIBLE_DEVICES=2,3 VLLM_WORKER_ID=1 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --request-plane-uri node:4223 --context-
worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_workers(): 416:        Launching worker with CUDA_VISIBLE_DEVICES=4,5 VLLM_WORKER_ID=2 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --request-plane-uri node:4223 --context-
worker-count 0 --generate-worker-count 1 --max-batch-size 10000 --gpu-memory-utilization 0.5
launch_workers.py: INFO: _launch_api_server(): 126:     Launching API server: python3 -m llm.api_server --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --request-plane-uri node:4223 --api-server-host node --api-serve
r-port 8005 --model-name llama
launch_workers.py: INFO: main(): 683:   waiting for <Popen: returncode: None args: ['python3', '-m', 'llm.vllm.deploy', '--model...>
WARNING 01-28 13:38:20 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
WARNING 01-28 13:38:20 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
13:38:21 deployment.py:116[triton_distributed.worker.deployment] INFO:

Starting Worker:

        Config:
        WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7fff175b5f80>,
             request_plane_args=([], {'request_plane_uri': 'node:4223'}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='llama',
                                       implementation=<class 'llm.vllm.operators.vllm.VllmContextOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=1000,
                                       parameters={'baseline_tp_size': 1,
                                                   'baseline_worker_count': 0,
                                                   'context_tp_size': 1,
                                                   'context_worker_count': 1,
                                                   'disable_async_output_proc': True,
                                                   'disable_log_stats': True,
                                                   'dtype': 'auto',
                                                   'dummy_worker_count': 0,
                                                   'enable_chunked_prefill': False,
                                                   'enable_prefix_caching': False,
                                                   'enforce_eager': False,
                                                   'generate_tp_size': 1,
                                                   'generate_worker_count': 0,
                                                   'gpu_memory_utilization': 0.9,
                                                   'ignore_eos': False,
                                                   'initialize_request_plane': True,
                                                   'kv_cache_dtype': 'fp8',
                                                   'log_dir': '',
                                                   'log_level': 1,
                                                   'max_batch_size': 10000,
                                                   'max_model_len': None,
                                                   'max_num_seqs': None,
                                                   'model_name': 'neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8',
                                                   'nats_url': 'nats://localhost:4223',
                                                   'request_plane_uri': 'node:4223',
                                                   'starting_metrics_port': 0,
                                                   'worker_name': 'llama'},
                                       log_level=None)],
             triton_log_path=None,
             name='llama.0',
             log_dir='',
             metrics_port=0)
        <SpawnProcess name='llama.0' parent=2971261 initial>

Workers started ... press Ctrl-C to Exit
[2972184] 2025/01/28 13:38:21.925873 [INF] Starting nats-server
[2972184] 2025/01/28 13:38:21.926005 [INF]   Version:  2.10.24
[2972184] 2025/01/28 13:38:21.926007 [INF]   Git:      [1d6f7ea]
[2972184] 2025/01/28 13:38:21.926009 [INF]   Name:     NDCCC54S4EE7UKEYVQGVDOTZCRAOSHOHW43SML5IIEXDSXLHULSP66DX
[2972184] 2025/01/28 13:38:21.926013 [INF]   Node:     09ObDJPK
[2972184] 2025/01/28 13:38:21.926014 [INF]   ID:       NDCCC54S4EE7UKEYVQGVDOTZCRAOSHOHW43SML5IIEXDSXLHULSP66DX
[2972184] 2025/01/28 13:38:21.926443 [INF] Starting JetStream
[2972184] 2025/01/28 13:38:21.926579 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[2972184] 2025/01/28 13:38:21.926583 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[2972184] 2025/01/28 13:38:21.926604 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[2972184] 2025/01/28 13:38:21.926606 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[2972184] 2025/01/28 13:38:21.926609 [INF]
[2972184] 2025/01/28 13:38:21.926611 [INF]          https://docs.nats.io/jetstream
[2972184] 2025/01/28 13:38:21.926614 [INF]
[2972184] 2025/01/28 13:38:21.926616 [INF] ---------------- JETSTREAM ----------------
[2972184] 2025/01/28 13:38:21.926619 [INF]   Max Memory:      1.48 TB
[2972184] 2025/01/28 13:38:21.926622 [INF]   Max Storage:     755.84 GB
[2972184] 2025/01/28 13:38:21.926624 [INF]   Store Directory: "/tmp/nats_store/jetstream"
[2972184] 2025/01/28 13:38:21.926626 [INF] -------------------------------------------
[2972184] 2025/01/28 13:38:21.928631 [INF] Listening for client connections on 0.0.0.0:4223
[2972184] 2025/01/28 13:38:21.930827 [INF] Server is ready
13:38:22 deployment.py:116[triton_distributed.worker.deployment] INFO:

Starting Worker:

        Config:
        WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7fff175b5f80>,
             request_plane_args=([], {'request_plane_uri': 'node:4223'}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='llama',
                                       implementation=<class 'llm.vllm.operators.vllm.VllmContextOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=1000,
                                       parameters={'baseline_tp_size': 1,
                                                   'baseline_worker_count': 0,
                                                   'context_tp_size': 1,
                                                   'context_worker_count': 1,
                                                   'disable_async_output_proc': True,
                                                   'disable_log_stats': True,
                                                   'dtype': 'auto',
                                                   'dummy_worker_count': 0,
                                                   'enable_chunked_prefill': False,
                                                   'enable_prefix_caching': False,
                                                   'enforce_eager': False,
                                                   'generate_tp_size': 1,
                                                   'generate_worker_count': 0,
                                                   'gpu_memory_utilization': 0.9,
                                                   'ignore_eos': False,
                                                   'initialize_request_plane': False,
                                                   'kv_cache_dtype': 'fp8',
                                                   'log_dir': '',
                                                   'log_level': 1,
                                                   'max_batch_size': 10000,
                                                   'max_model_len': None,
                                                   'max_num_seqs': None,
                                                   'model_name': 'neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8',
                                                   'nats_url': 'nats://localhost:4223',
                                                   'request_plane_uri': 'node:4223',
                                                   'starting_metrics_port': 0,
                                                   'worker_name': 'llama'},
                                       log_level=None)],
             triton_log_path=None,
             name='llama.0',
             log_dir='',
             metrics_port=0)
        <SpawnProcess name='llama.0' parent=2971392 initial>

Workers started ... press Ctrl-C to Exit
WARNING 01-28 13:38:22 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
Namespace(api_server_host='node', api_server_port=8005, request_plane_uri='node:4223', data_plane_host=None, data_plane_port=0, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', model_name='llama', log_level=1, progra
m_name='OpenAI API Sever')
13:38:23 __main__.py:28[OpenAI API Sever] INFO: Starting
[WARNING] Adding CORS for the following origins: ['http://localhost']
INFO:     Started server process [2971700]
INFO:     Waiting for application startup.
TRACE:    ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}, 'state': {}}
TRACE:    ASGI [1] Receive {'type': 'lifespan.startup'}
TRACE:    ASGI [1] Send {'type': 'lifespan.startup.complete'}
INFO:     Application startup complete.
INFO:     Uvicorn running on http://node:8005 (Press CTRL+C to quit)
13:38:24 deployment.py:116[triton_distributed.worker.deployment] INFO:

Starting Worker:

        Config:
        WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7fff175b1f80>,
             request_plane_args=([], {'request_plane_uri': 'node:4223'}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='generate',
                                       implementation=<class 'llm.vllm.operators.vllm.VllmGenerateOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=1000,
                                       parameters={'baseline_tp_size': 1,
                                                   'baseline_worker_count': 0,
                                                   'context_tp_size': 1,
                                                   'context_worker_count': 0,
                                                   'disable_async_output_proc': True,
                                                   'disable_log_stats': True,
                                                   'dtype': 'auto',
                                                   'dummy_worker_count': 0,
                                                   'enable_chunked_prefill': False,
                                                   'enable_prefix_caching': False,
                                                   'enforce_eager': False,
                                                   'generate_tp_size': 1,
                                                   'generate_worker_count': 1,
                                                   'gpu_memory_utilization': 0.5,
                                                   'ignore_eos': False,
                                                   'initialize_request_plane': False,
                                                   'kv_cache_dtype': 'fp8',
                                                   'log_dir': '',
                                                   'log_level': 1,
                                                   'max_batch_size': 10000,
                                                   'max_model_len': None,
                                                   'max_num_seqs': None,
                                                   'model_name': 'neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8',
                                                   'nats_url': 'nats://localhost:4223',
                                                   'request_plane_uri': 'node:4223',
                                                   'starting_metrics_port': 0,
                                                   'worker_name': 'llama'},
                                       log_level=None)],
             triton_log_path=None,
             name='generate.0',
             log_dir='',
             metrics_port=0)
        <SpawnProcess name='generate.0' parent=2971568 initial>

Workers started ... press Ctrl-C to Exit
WARNING 01-28 13:38:26 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
WARNING 01-28 13:38:26 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
WARNING 01-28 13:38:30 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause
errors. See https://pypi.org/project/pynvml for more information.
INFO 01-28 13:38:33 config.py:653] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-28 13:38:33 config.py:653] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 01-28 13:38:33 arg_utils.py:967] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider settin
g --max-model-len to a smaller value.
WARNING 01-28 13:38:33 arg_utils.py:967] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider settin
g --max-model-len to a smaller value.
INFO 01-28 13:38:33 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post2.dev16+gf61960ce) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-I
nstruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131
072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_p
ath=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=Fa
lse), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached
_outputs=False, mm_processor_kwargs=None)
INFO 01-28 13:38:33 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post2.dev16+gf61960ce) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-I
nstruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131
072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_p
ath=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=Fa
lse), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached
_outputs=False, mm_processor_kwargs=None)
INFO 01-28 13:38:33 selector.py:141] Using Flashinfer backend.
INFO 01-28 13:38:33 selector.py:141] Using Flashinfer backend.
INFO 01-28 13:38:34 parallel_state.py:939] ==================================================
INFO 01-28 13:38:34 parallel_state.py:940] Patching init_distributed_environment
INFO 01-28 13:38:34 parallel_state.py:942] Stage: PREFILL
[W128 13:38:34.115654817 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.115855462 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116029983 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116208041 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116361675 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116533762 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116697932 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.116871732 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117042191 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117200333 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117361510 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
[W128 13:38:34.117524740 socket.cpp:697] [c10d] The client socket has failed to connect to [node.cluster.com]:36183 (errno: 22 - Invalid argument).
INFO 01-28 13:38:34 parallel_state.py:939] ==================================================
INFO 01-28 13:38:34 parallel_state.py:940] Patching init_distributed_environment
INFO 01-28 13:38:34 parallel_state.py:942] Stage: PREFILL
INFO 01-28 13:38:35 parallel_state.py:958] world_size: 3, rank: 1, distributed_init_method: None, local_rank: 0, backend: nccl
DEBUG 01-28 13:38:35 parallel_state.py:960] world_size=3 rank=1 local_rank=0 distributed_init_method=None backend=nccl
INFO 01-28 13:38:36 config.py:653] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 01-28 13:38:36 arg_utils.py:967] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider settin
g --max-model-len to a smaller value.
INFO 01-28 13:38:36 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post2.dev16+gf61960ce) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-I
nstruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131
072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_
path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=F
alse), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cache
d_outputs=False, mm_processor_kwargs=None)
INFO 01-28 13:38:37 selector.py:141] Using Flashinfer backend.
INFO 01-28 13:38:37 parallel_state.py:939] ==================================================
INFO 01-28 13:38:37 parallel_state.py:940] Patching init_distributed_environment
INFO 01-28 13:38:37 parallel_state.py:942] Stage: GENERATE
INFO 01-28 13:38:37 parallel_state.py:958] world_size: 6, rank: 4, distributed_init_method: None, local_rank: 0, backend: nccl
DEBUG 01-28 13:38:37 parallel_state.py:960] world_size=6 rank=4 local_rank=0 distributed_init_method=None backend=nccl
INFO 01-28 13:38:37 parallel_state.py:958] world_size: 3, rank: 0, distributed_init_method: None, local_rank: 0, backend: nccl
DEBUG 01-28 13:38:37 parallel_state.py:960] world_size=3 rank=0 local_rank=0 distributed_init_method=None backend=nccl

piotrm-nvidia · 2025-01-29T08:46:35Z

The execution outside launching script is successful for config below:

export VLLM_DATA_PLANE_BACKEND=nccl
export VLLM_TORCH_HOST=localhost
export VLLM_GENERATE_WORKERS=1
export VLLM_TORCH_PORT=36183
export VLLM_CONTEXT_TP_SIZE=2
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_BASELINE_WORKERS=0
export VLLM_GENERATE_TP_SIZE=2
export VLLM_LOGGING_LEVEL=INFO
export VLLM_CONTEXT_WORKERS=1
export VLLM_BASELINE_TP_SIZE=1

CUDA_VISIBLE_DEVICES=0,1 \
VLLM_WORKER_ID=0 \
python3 -m llm.vllm.deploy \
  --context-worker-count 1 \
  --request-plane-uri ${HOSTNAME}:4223 \
  --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --kv-cache-dtype fp8 \
  --dtype auto \
  --worker-name llama \
  --disable-async-output-proc \
  --disable-log-stats \
  --max-model-len 3500 \
  --max-batch-size 10000 \
  --gpu-memory-utilization 0.9 \
  --context-tp-size 2 \
  --generate-tp-size 2 \
  --initialize-request-plane &

CUDA_VISIBLE_DEVICES=2,3 \
VLLM_WORKER_ID=1 \
python3 -m llm.vllm.deploy \
  --generate-worker-count 1 \
  --request-plane-uri ${HOSTNAME}:4223 \
  --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --kv-cache-dtype fp8 \
  --dtype auto \
  --worker-name llama \
  --disable-async-output-proc \
  --disable-log-stats \
  --max-model-len 3500 \
  --max-batch-size 10000 \
  --gpu-memory-utilization 0.9 \
  --context-tp-size 2 \
  --generate-tp-size 2 &

python3 -m llm.api_server \
  --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --request-plane-uri ${HOSTNAME}:4223 \
  --api-server-host ${HOSTNAME} \
  --model-name llama \
  --api-server-port 8005 &

piotrm-nvidia · 2025-01-29T13:10:15Z

You can test launching script at single 8xH100 machine:

python3 launch_workers.py \
    --log-level INFO \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --context-workers 2 \
    --context-tp-size 2 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1  \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <YOUR DIR>

It should print genai-perf output tables with performance and save results in YOUR DIR.

piotrm-nvidia · 2025-01-29T14:28:30Z

I executed slurm cluster sbatch launch script at two nodes:

#!/bin/bash
#SBATCH --partition=<PARTITION>
#SBATCH --account=<ACCOUNT>
#SBATCH --job-name=<JOB>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=2:00:00
#SBATCH --mem=0 \
#SBATCH --no-kill \
#SBATCH --exclusive
#SBATCH --output=<ARTIFACTS>/%x_%j_%n_%N.out              ### Slurm Output file, %x is job name, %j is job id
#SBATCH --error=<ARTIFACTS>/%x_%j_%n_%N.err               ### Slurm Error file, %x is job 

set -e
set -x

export HF_TOKEN=<TOKEN>

export JOB_DIR=<ARTIFACTS>
export LOGDIR=${JOB_DIR}/logs
export PROFILESDIR=${JOB_DIR}/profiles
export SCHEDULER_FILE=$LOGDIR/scheduler.json
export SCHEDULER_LOG=$LOGDIR/scheduler.log
export DONE_MARKER=$LOGDIR/done.txt

export DEVICE="gpu"

export INTERFACE="eth3"
export PROTOCOL="tcp"

export CPU_WORKER_MEMORY_LIMIT="14GB"
export RAPIDS_NO_INITIALIZE="1"
export CUDF_SPILL="1"
export RMM_SCHEDULER_POOL_SIZE="1GB"
export RMM_WORKER_POOL_SIZE="72GiB"
export LIBCUDF_CUFILE_POLICY=OFF
export DASK_DATAFRAME__QUERY_PLANNING=False

mkdir -p $LOGDIR
mkdir -p $PROFILESDIR

srun --container-mounts=/lustre/fsw/:/lustre/fsw/ --container-image=<CONTAINER> bash -c "python3 launch_workers.py \
    --log-level INFO \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --context-workers 2 \
    --context-tp-size 2 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1 1 32 \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <ARTIFACTS>"

The output from logs suggest that NATS.io host is not propagating into any of NATS.io related components correctly.

python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 0 --generate-worker-count 1 --max-batch-size 10000 --gpu-memory-utilization 0.5
launch_workers.py: INFO: main(): 701:   Namespace(model_ckpt='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', model_name='llama', log_dir='/lustre/fsw/coreai_tritoninference_triton3/piotrm/src/triton-distributed/examples/llm/logs', dry_run=False, context_tp_size=2, generate_tp_size=8, baseline_tp_size=1, context_max_batch_size=10000, generate_max_batch_size=10000, baseline_max_batch_size=10000, log_level='INFO', nats_url='nats://node:4223', nats_store='/tmp/nats/triton-3-demo', nats_debug=False, context_workers=4, generate_workers=1, baseline_workers=None, api_server_url='http://node:8005', workers_only=False, max_model_len=None, max_num_seqs=-1, context_max_num_seqs=-1, generate_max_num_seqs=-1, benchmark=True, benchmark_timeout=800, isl_cached=0, isl_uncached=3000, osl=150, load_type='concurrency', load_value=[1, 1, 16, 32], request_count_per_load_value=10, min_request_count=20, data_plane_backend='nccl', enable_chunked_prefill=False, enable_prefix_caching=False, artifact_dir='/lustre/fsw/coreai_tritoninference_triton3/piotrm/src/triton-distributed/art_0129_v2', baseline_gpu_memory_utilization=0.9, context_gpu_memory_utilization=0.9, generate_gpu_memory_utilization=0.5, profile_workers=False, visible_devices=[0, 1, 2, 3, 4, 5, 6, 7], host='node')
launch_workers.py: INFO: main(): 702:   Example root: /lustre/fsw/coreai_tritoninference_triton3/piotrm/src/triton-distributed/examples/llm
launch_workers.py: INFO: _launch_workers(): 320:        Ntasks configuration 2
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_ID=0 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9 --initialize-request-plane
launch_workers.py: INFO: main(): 738:   waiting for <Popen: returncode: None args: ['python3', '-m', 'llm.vllm.deploy', '--model...>
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=2,3 VLLM_WORKER_ID=1 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=4,5 VLLM_WORKER_ID=2 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_workers(): 446:        Launching worker with CUDA_VISIBLE_DEVICES=6,7 VLLM_WORKER_ID=3 :
 python3 -m llm.vllm.deploy --model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --kv-cache-dtype fp8 --dtype auto --worker-name llama --disable-async-output-proc --disable-log-stats --context-tp-size 2 --generate-tp-size 8 --baseline-tp-size 1 --request-plane-uri nats://node:4223 --context-worker-count 1 --generate-worker-count 0 --max-batch-size 10000 --gpu-memory-utilization 0.9
launch_workers.py: INFO: _launch_api_server(): 133:     Launching API server: python3 -m llm.api_server --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --request-plane-uri nats://node:4223 --api-server-host node --api-server-port 8005 --model-name llama
launch_workers.py: WARNING: wait_for_server(): 569:     Server not responding: HTTPConnectionPool(host='node', port=8005): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffff75afad0>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2469896] 2025/01/29 05:25:02.968507 [INF] Starting nats-server
[2469896] 2025/01/29 05:25:02.968632 [INF]   Version:  2.10.24
[2469896] 2025/01/29 05:25:02.968635 [INF]   Git:      [1d6f7ea]
[2469896] 2025/01/29 05:25:02.968638 [INF]   Name:     NAAHV7QMT2RZBP3GB46O4ANWCSFHMB7M264RQCMB2BO2GR2GFY3FAR32
[2469896] 2025/01/29 05:25:02.968644 [INF]   Node:     zBBWatmk
[2469896] 2025/01/29 05:25:02.968645 [INF]   ID:       NAAHV7QMT2RZBP3GB46O4ANWCSFHMB7M264RQCMB2BO2GR2GFY3FAR32
[2469896] 2025/01/29 05:25:02.968909 [INF] Starting JetStream
[2469896] 2025/01/29 05:25:02.969007 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[2469896] 2025/01/29 05:25:02.969009 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[2469896] 2025/01/29 05:25:02.969010 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[2469896] 2025/01/29 05:25:02.969011 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[2469896] 2025/01/29 05:25:02.969012 [INF]
[2469896] 2025/01/29 05:25:02.969013 [INF]          https://docs.nats.io/jetstream
[2469896] 2025/01/29 05:25:02.969014 [INF]
[2469896] 2025/01/29 05:25:02.969014 [INF] ---------------- JETSTREAM ----------------
[2469896] 2025/01/29 05:25:02.969016 [INF]   Max Memory:      1.48 TB
[2469896] 2025/01/29 05:25:02.969018 [INF]   Max Storage:     755.84 GB
[2469896] 2025/01/29 05:25:02.969019 [INF]   Store Directory: "/tmp/nats_store/jetstream"
[2469896] 2025/01/29 05:25:02.969020 [INF] -------------------------------------------
[2469896] 2025/01/29 05:25:02.970594 [INF] Listening for client connections on 0.0.0.0:4223

...

[rank1]:[W129 05:28:40.599990280 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank12]:[W129 05:28:40.644431120 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank0]:[W129 05:28:40.644566749 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank8]:[W129 05:28:40.699022478 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank9]:[W129 05:28:40.999093155 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank13]:[W129 05:28:40.152316566 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank10]:[W129 05:28:41.636855564 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank14]:[W129 05:28:41.744224705 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank11]:[W129 05:28:41.275960017 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank15]:[W129 05:28:41.314099102 ProcessGroupNCCL.cpp:2892] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
nats: encountered error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1130, in create_connection
    raise OSError('Multiple exceptions: {}'.format(
OSError: Multiple exceptions: [Errno 111] Connect call failed ('::1', 4223, 0, 0), [Errno 111] Connect call failed ('127.0.0.1', 4223)
nats: encountered error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1130, in create_connection
    raise OSError('Multiple exceptions: {}'.format(
OSError: Multiple exceptions: [Errno 111] Connect call failed ('::1', 4223, 0, 0), [Errno 111] Connect call failed ('127.0.0.1', 4223)
nats: encountered error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server

…to piotrm/rewrite_old_launchers

piotrm-nvidia · 2025-01-29T16:47:19Z

Benchmark command for context TP=2 DP=2 generate TP=4 DP=1:

python3 launch_workers.py \
    --log-level INFO \
    --nats-url <HOST>:4223 \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --context-workers 2 \
    --context-tp-size 2 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --max-model-len 3500 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1 1 2 4 8 61 32 64 128 256  \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <FOLDER>/art_0129_70b_single_node

Benchmark command for context TP=4 DP=1 generate TP=4 DP=1:

python3 launch_workers.py \
    --log-level INFO \
    --nats-url <HOST>:4223 \
    --model-name llama \
    --model-ckpt neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --context-workers 1 \
    --context-tp-size 4 \
    --generate-workers 1 \
    --generate-tp-size 4 \
    --max-model-len 3500 \
    --benchmark \
    --benchmark-timeout 800 \
    --isl-cached 0 \
    --isl-uncached 3000 \
    --osl 150 \
    --load-type concurrency \
    --load-value 1 1 2 4 8 61 32 64 128 256  \
    --min-request-count 20 \
    --request-count-per-load-value 10 \
    --artifact-dir <FOLDER>/art_0129_70b_single_node

Add draft launching script

9f7f23b

piotrm-nvidia temporarily deployed to GITLAB January 28, 2025 21:29 — with GitHub Actions Inactive

piotrm-nvidia temporarily deployed to GITLAB January 28, 2025 21:33 — with GitHub Actions Inactive

piotrm-nvidia requested a review from ptarasiewiczNV January 28, 2025 22:16

Add explicte TP parameter

7d435f6

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 11:08 — with GitHub Actions Inactive

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 11:14 — with GitHub Actions Inactive

Adjust args passing to have str

9c47706

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 12:27 — with GitHub Actions Inactive

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 12:28 — with GitHub Actions Inactive

Replace incorrect run_benchmark.py

05903d4

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 12:45 — with GitHub Actions Inactive

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 12:46 — with GitHub Actions Inactive

Remove not support prefix from benchmark

1bc03b1

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 12:53 — with GitHub Actions Inactive

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 12:55 — with GitHub Actions Inactive

piotrm-nvidia requested a review from glos-nv January 29, 2025 13:10

piotrm-nvidia added 2 commits January 29, 2025 16:41

Adjust NATS.io initialization

4c14bd9

Merge remote-tracking branch 'origin/piotrm/rewrite_old_launchers' in…

18d3f4f

…to piotrm/rewrite_old_launchers

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 15:43 — with GitHub Actions Inactive

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 15:44 — with GitHub Actions Inactive

Fix cyclic import in request plane

627df9d

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 16:01 — with GitHub Actions Inactive

piotrm-nvidia temporarily deployed to GITLAB January 29, 2025 16:02 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add draft launching script #70

Add draft launching script #70

piotrm-nvidia commented Jan 28, 2025

piotrm-nvidia commented Jan 28, 2025

piotrm-nvidia commented Jan 29, 2025

piotrm-nvidia commented Jan 29, 2025

piotrm-nvidia commented Jan 29, 2025 •

edited

Loading

piotrm-nvidia commented Jan 29, 2025

Add draft launching script #70

Are you sure you want to change the base?

Add draft launching script #70

Conversation

piotrm-nvidia commented Jan 28, 2025

piotrm-nvidia commented Jan 28, 2025

Works for TP==1

Fails for TP>1

Reproduction

piotrm-nvidia commented Jan 29, 2025

piotrm-nvidia commented Jan 29, 2025

piotrm-nvidia commented Jan 29, 2025 • edited Loading

piotrm-nvidia commented Jan 29, 2025

piotrm-nvidia commented Jan 29, 2025 •

edited

Loading