Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in helo world example witch UCX error #39

Open
piotrm-nvidia opened this issue Jan 17, 2025 · 1 comment
Open

Deadlock in helo world example witch UCX error #39

piotrm-nvidia opened this issue Jan 17, 2025 · 1 comment

Comments

@piotrm-nvidia
Copy link
Contributor

piotrm-nvidia commented Jan 17, 2025

JIRA: DLIS-7830

The example fails single_file.py with unreachable UCX error:

ucp._libs.exceptions.UCXUnreachable: <stream_recv>:

Default example configuration of network interface used for UCX doesn't work at this machine with running docker.

Reproduction

Use branch https://github.com/triton-inference-server/triton-distributed/tree/piotrm-add-nats-hosts

Start NATS.io server

nats-server -js

Start example with default host and port passed:

python3 single_file.py  --request-plane-uri htp://127.0.0.1:4222

Expected result

The example sends requests and processes them in workers and return with no error.

Results

Log indicates no request was processed:

Starting Workers
22:01:56 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x769741a60f40>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50000)
	<SpawnProcess name='encoder.0' parent=2466 initial>

22:01:56 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x769741a60f40>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='decoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_GPU'}],
                                                              'parameters': {'delay': {'string_value': '0'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50001)
	<SpawnProcess name='decoder.0' parent=2466 initial>

22:01:56 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x769741a60f40>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder_decoder',
                                       implementation=<class '__main__.EncodeDecodeOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=100,
                                       parameters={},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder_decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50002)
	<SpawnProcess name='encoder_decoder.0' parent=2466 initial>

Sending Requests
Sending Requests:   0%| 

Error logs analysis

The most important output logs are not printed at output but pushed into several log files:

-rw-r--r-- 1 root root    198 Jan 17 22:01 decoder.0.ab1dbe5f-d51e-11ef-8242-88a4c2b6c3a5.2507.stderr.log
-rw-r--r-- 1 root root    134 Jan 17 22:01 decoder.0.ab1dbe5f-d51e-11ef-8242-88a4c2b6c3a5.2507.stdout.log
-rw-r--r-- 1 root root  11818 Jan 17 22:02 decoder.0.ab1dbe5f-d51e-11ef-8242-88a4c2b6c3a5.2507.triton.log
-rw-r--r-- 1 root root    316 Jan 17 22:01 encoder.0.ab202624-d51e-11ef-acce-88a4c2b6c3a5.2506.stderr.log
-rw-r--r-- 1 root root    134 Jan 17 22:01 encoder.0.ab202624-d51e-11ef-acce-88a4c2b6c3a5.2506.stdout.log
-rw-r--r-- 1 root root  12072 Jan 17 22:02 encoder.0.ab202624-d51e-11ef-acce-88a4c2b6c3a5.2506.triton.log
-rw-r--r-- 1 root root 118938 Jan 17 22:03 encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stderr.log
-rw-r--r-- 1 root root  32208 Jan 17 22:03 encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stdout.log
-rw-r--r-- 1 root root   4477 Jan 17 22:02 encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.triton.log
-rw-r--r-- 1 root root   1570 Jan 17 22:01 nats_server.stderr.log
-rw-r--r-- 1 root root      0 Jan 17 22:01 nats_server.stdout.log

All logs zip: logs.zip

It is necessary to inspect all other them to identify root cause of failure.

Log encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stdout.log:

22:03:14 single_file.py:55[OPERATOR('encoder_decoder', 1)] INFO: got request!

Log encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stderr.log:

future: <Task finished name='Task-21' coro=<Worker._process_request_task() done, defined at /workspace/worker/src/python/triton_distributed/worker/worker.py:176> exception=DataPlaneError('Error Referencing Tensor:\n{remote_tensor}')>
Traceback (most recent call last):
  File "/workspace/icp/src/python/triton_distributed/icp/ucp_data_plane.py", line 400, in _create_remote_tensor_reference
    endpoint = await self._create_endpoint(host, port)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/icp/src/python/triton_distributed/icp/ucp_data_plane.py", line 497, in _create_endpoint
    endpoint = await ucp.create_endpoint(host, port)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 1016, in create_endpoint
    return await _get_ctx().create_endpoint(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 328, in create_endpoint
    peer_info = await exchange_peer_info(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 60, in exchange_peer_info
    await asyncio.wait_for(
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
ucp._libs.exceptions.UCXUnreachable: <stream_recv>:

The above exception was the direct cause of other exceptions.

Network configuration of the docker instance

Network configuration in docker:

root@ulalegionbuntu:/workspace/examples/hello_world/logs# ifconfig 
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:e8:c4:ef:de  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 88:a4:c2:b6:c3:a5  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 3155  bytes 853385 (853.3 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3155  bytes 853385 (853.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.9.193  netmask 255.255.252.0  broadcast 192.168.11.255
        inet6 fe80::86a5:f982:b461:e36d  prefixlen 64  scopeid 0x20<link>
        ether c0:3c:59:4c:02:c0  txqueuelen 1000  (Ethernet)
        RX packets 33102  bytes 12013650 (12.0 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5005  bytes 1052106 (1.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Python workers analysis

All python thread stopped at idle:

root@ulalegionbuntu:/workspace/examples/hello_world# ps aux | grep python
root         648  0.0  0.0  17048 12416 pts/0    S    21:52   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root         652  0.6  1.5 10636956 451144 pts/0 Sl   21:52   0:04 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        1127  0.0  0.0  17048 12544 pts/0    S    21:52   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        1133  0.6  1.5 10636960 450712 pts/0 Sl   21:52   0:03 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        1609  0.0  0.0  17048 12288 pts/0    S    21:59   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        1610  1.2  2.5 9600068 719432 pts/0  Sl   21:59   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
root        2015  0.0  0.0  17048 12544 pts/0    S    22:00   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        2021  1.6  1.5 10637080 450428 pts/0 Sl   22:00   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        2466  3.8  0.9 9539144 282900 pts/0  Sl   22:01   0:03 python3 single_file.py --request-plane-uri htp://127.0.0.1:4222
root        2505  0.0  0.0  17048 12544 pts/0    S    22:01   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        2506  2.9  2.5 13794504 719692 pts/0 Sl   22:01   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
root        2507  3.3  2.5 13794632 724220 pts/0 Sl   22:01   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=23) --multiprocessing-fork
root        2511  3.0  2.5 11235536 716496 pts/0 Sl   22:01   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        2776  2.4  0.7 4448284 219448 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/encoder/1/model.py triton_python_backend_shm_region_64d9fc50-5d6a-4fdc-882d-37e5a33bbcdc 1048576 1048576 2506 /opt/tritonserver/backends/python 336 encoder_0_0 DEFAULT
root        2789  2.3  0.7 4448280 219508 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/decoder/1/model.py triton_python_backend_shm_region_a3dc8bb9-70b3-4f3b-a160-0a3ad569266d 1048576 1048576 2507 /opt/tritonserver/backends/python 336 decoder_0_0 DEFAULT
root        2963  0.0  0.0   3532  1792 pts/0    S+   22:03   0:00 grep --color=auto python


root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2466
Process 2466: python3 single_file.py --request-plane-uri htp://127.0.0.1:4222
Python v3.12.3 (/usr/bin/python3.12)

Thread 2466 (idle): "MainThread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    <module> (hello_world/single_file.py:215)
Thread 2514 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2515 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2518 (idle): "Thread-2"
    wait (threading.py:359)
    wait (threading.py:655)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# ps aux | grep triton
root        2776  0.8  0.7 4448284 219448 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/encoder/1/model.py triton_python_backend_shm_region_64d9fc50-5d6a-4fdc-882d-37e5a33bbcdc 1048576 1048576 2506 /opt/tritonserver/backends/python 336 encoder_0_0 DEFAULT
root        2789  0.7  0.7 4448280 219508 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/decoder/1/model.py triton_python_backend_shm_region_a3dc8bb9-70b3-4f3b-a160-0a3ad569266d 1048576 1048576 2507 /opt/tritonserver/backends/python 336 decoder_0_0 DEFAULT
root        2967  0.0  0.0   3532  1792 pts/0    S+   22:05   0:00 grep --color=auto triton
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 648 
Process 648: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 648 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 652
Process 652: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 652 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1127
Process 1127: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 1127 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1133
Process 1133: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 1133 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1609
Process 1609: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 1609 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1610
Process 1610: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 1610 (idle): "MainThread"
    load (tritonserver/_api/_server.py:931)
    __init__ (worker/triton_core_operator.py:89)
    _import_operators (worker/worker.py:146)
    serve (worker/worker.py:256)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1987)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:370)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 1762 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 1768 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2015
Process 2015: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 2015 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2021
Process 2021: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2021 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2505
Process 2505: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 2505 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2506
Process 2506: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2506 (idle): "MainThread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:376)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 2655 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2663 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2507
Process 2507: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=23) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2507 (idle): "MainThread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:370)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 2656 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2661 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2511
Process 2511: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2511 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    create_input_tensor_reference (icp/ucp_data_plane.py:461)
    _set_model_infer_request_inputs (worker/remote_request.py:78)
    to_model_infer_request (worker/remote_request.py:199)
    async_infer (worker/remote_operator.py:108)
    execute (hello_world/single_file.py:56)
    _process_request (worker/worker.py:172)
    _process_request_task (worker/worker.py:188)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1987)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:370)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 2654 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2662 (idle): "Thread-1 (_run_event_loop)"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
    __enter__ (contextlib.py:132)
    inner (contextlib.py:80)
    _arm_worker (ucp/continuous_ucx_progress.py:98)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1987)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2776
Process 2776: /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/encoder/1/model.py triton_python_backend_shm_region_64d9fc50-5d6a-4fdc-882d-37e5a33bbcdc 1048576 1048576 2506 /opt/tritonserver/backends/python 336 encoder_0_0 DEFAULT
Python v3.12.3 (/opt/tritonserver/backends/python/triton_python_backend_stub)

Thread 2776 (idle): "MainThread"
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2789
Process 2789: /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/decoder/1/model.py triton_python_backend_shm_region_a3dc8bb9-70b3-4f3b-a160-0a3ad569266d 1048576 1048576 2507 /opt/tritonserver/backends/python 336 decoder_0_0 DEFAULT
Python v3.12.3 (/opt/tritonserver/backends/python/triton_python_backend_stub)

Thread 2789 (idle): "MainThread"

@piotrm-nvidia piotrm-nvidia changed the title Deadlock in helo wolrd example Deadlock in helo wolrd example eiyh UCX error error Jan 17, 2025
@piotrm-nvidia piotrm-nvidia changed the title Deadlock in helo wolrd example eiyh UCX error error Deadlock in helo wolrd example witch UCX error Jan 17, 2025
@piotrm-nvidia piotrm-nvidia changed the title Deadlock in helo wolrd example witch UCX error Deadlock in helo world example witch UCX error Jan 17, 2025
@piotrm-nvidia
Copy link
Contributor Author

See analysis in test_sanity fails. The root cause is the same: #36

The root cause seems to be interaction between UCP and UCX in docker environment in the test machine.

  1. Default behavior of Worker class is to not pass any data plane hostname and port.
  2. UCP data plane defaults are used, which uses ucp.get_address(), which returns 0.0.0.0 at this host and random port.
  3. The UCP server and client actually use 172.17.0.1 what is a docker0 interface and it doesn't work with error message wireup_cm.c:597 UCX DIAG client ep 0x733c91e49080 connect to 172.17.0.1:38467 failed: device docker0 is not enabled, enable it in UCX_NET_DEVICES or use corresponding ip address

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant