Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_sanity fails #36

Closed
piotrm-nvidia opened this issue Jan 17, 2025 · 5 comments
Closed

test_sanity fails #36

piotrm-nvidia opened this issue Jan 17, 2025 · 5 comments

Comments

@piotrm-nvidia
Copy link
Contributor

root@ulalegionbuntu:/workspace/examples/hello_world# pytest tests/test_sanity.py 
/usr/local/lib/python3.12/dist-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================== test session starts ==================================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /workspace
configfile: pyproject.toml
plugins: anyio-4.8.0, mock-3.14.0, timeout-2.3.1, asyncio-0.25.2, benchmark-5.1.0, cov-6.0.0, md-report-0.6.3, mypy-0.10.3
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None
collected 3 items                                                                       

tests/test_sanity.py ..F                                                          [100%]

======================================= FAILURES ========================================
______________________________________ test_sanity ______________________________________

    def test_sanity():
        deployment_command = [
            "python3",
            "-m",
            "hello_world.deploy",
            "--initialize-request-plane",
        ]
    
        deployment_process = subprocess.Popen(
            deployment_command,
            stdin=subprocess.DEVNULL,
        )
    
        client_command = [
            "python3",
            "-m",
            "hello_world.client",
            "--requests-per-client",
            "10",
        ]
    
        client_process = subprocess.Popen(
            client_command,
            stdin=subprocess.DEVNULL,
        )
        try:
            client_process.wait(timeout=30)
        except subprocess.TimeoutExpired:
            print("Client timed out!")
            client_process.terminate()
            client_process.wait()
    
        client_process.terminate()
        client_process.kill()
        client_process.wait()
        deployment_process.terminate()
        deployment_process.wait()
>       assert client_process.returncode == 0, "Error in clients!"
E       AssertionError: Error in clients!
E       assert 241 == 0
E        +  where 241 = <Popen: returncode: 241 args: ['python3', '-m', 'hello_world.client', '--req...>.returncode

client_command = ['python3', '-m', 'hello_world.client', '--requests-per-client', '10']
client_process = <Popen: returncode: 241 args: ['python3', '-m', 'hello_world.client', '--req...>
deployment_command = ['python3', '-m', 'hello_world.deploy', '--initialize-request-plane']
deployment_process = <Popen: returncode: 1 args: ['python3', '-m', 'hello_world.deploy', '--initi...>

tests/test_sanity.py:63: AssertionError
--------------------------------- Captured stdout call ----------------------------------
Starting Workers
20:29:13 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7feac4838220>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'},
                                                                             'input_copies': {'string_value': '1'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50000)
	<SpawnProcess name='encoder.0' parent=1172 initial>

20:29:13 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7feac4838220>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='decoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'},
                                                                             'input_copies': {'string_value': '1'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50001)
	<SpawnProcess name='decoder.0' parent=1172 initial>

20:29:13 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7feac4838220>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder_decoder',
                                       implementation='EncodeDecodeOperator',
                                       repository='/workspace/examples/hello_world/operators',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder_decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50002)
	<SpawnProcess name='encoder_decoder.0' parent=1172 initial>

Client timed out!
Stopping Clients
Clients Stopped Exit Code -15
Workers started ... press Ctrl-C to Exit
Stopping Workers
20:29:42 deployment.py:128[triton_distributed.worker.deployment] INFO: 

Stopping Worker:


	<SpawnProcess name='encoder.0' pid=1248 parent=1172 started>

20:29:42 deployment.py:128[triton_distributed.worker.deployment] INFO: 

Stopping Worker:


	<SpawnProcess name='decoder.0' pid=1249 parent=1172 started>

20:29:42 deployment.py:128[triton_distributed.worker.deployment] INFO: 

Stopping Worker:


	<SpawnProcess name='encoder_decoder.0' pid=1253 parent=1172 started>

20:29:46 deployment.py:137[triton_distributed.worker.deployment] INFO: 

Worker Stopped:


	<SpawnProcess name='encoder.0' pid=1248 parent=1172 stopped exitcode=0>

20:29:46 deployment.py:137[triton_distributed.worker.deployment] INFO: 

Worker Stopped:


	<SpawnProcess name='decoder.0' pid=1249 parent=1172 stopped exitcode=0>

20:29:46 deployment.py:137[triton_distributed.worker.deployment] INFO: 

Worker Stopped:


	<SpawnProcess name='encoder_decoder.0' pid=1253 parent=1172 stopped exitcode=1>

Workers Stopped Exit Code 1
--------------------------------- Captured stderr call ----------------------------------
Client: 0:   0%|          | 0/10 [00:00<?, ?request/s]
========================================= mypy ==========================================
Success: no issues found in 1 source file
================================ short test summary info ================================
FAILED tests/test_sanity.py::test_sanity - AssertionError: Error in clients!
============================= 1 failed, 2 passed in 34.29s ==============================

@tanmayv25
Copy link
Contributor

It does pass on my machine.

root@859ebeb-lcedt:/workspace/examples/hello_world# pytest tests/test_sanity.py
/usr/local/lib/python3.12/dist-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================================================== test session starts ==============================================================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /workspace
configfile: pyproject.toml
plugins: mock-3.14.0, timeout-2.3.1, benchmark-5.1.0, cov-6.0.0, asyncio-0.25.2, anyio-4.8.0, mypy-0.10.3, md-report-0.6.3
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None
collected 3 items                                                                                                                               

tests/test_sanity.py ...                                                                                                                  [100%]
===================================================================== mypy ======================================================================

Success: no issues found in 1 source file
============================================================== 3 passed in 12.55s ===============================================================
root@859ebeb-lcedt:/workspace/examples/hello_world#

The client seems to be timing out in your runs.

@piotrm-nvidia
Copy link
Contributor Author

UCX fails during endpoint creation with timeout:

future: <Task finished name='Task-12' coro=<Worker._process_request_task() done, defined at /workspace/worker/src/python/triton_distributed/worker/worker.py:176> exception=DataPlaneError('Error Referencing Tensor:\n{remote_tensor}')>
Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/icp/src/python/triton_distributed/icp/ucp_data_plane.py", line 400, in _create_remote_tensor_reference
    endpoint = await self._create_endpoint(host, port)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/icp/src/python/triton_distributed/icp/ucp_data_plane.py", line 497, in _create_endpoint
    endpoint = await ucp.create_endpoint(host, port)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 1016, in create_endpoint
    return await _get_ctx().create_endpoint(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 328, in create_endpoint
    peer_info = await exchange_peer_info(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 60, in exchange_peer_info
    await asyncio.wait_for(
  File "/usr/lib/python3.12/asyncio/tasks.py", line 519, in wait_for
    async with timeouts.timeout(timeout):
  File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

The above exception was the direct cause of the following exception:

GPU, CUDA, drivers:

root@ulalegionbuntu:/workspace/examples/hello_world# nvidia-smi 
Mon Jan 20 14:07:42 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   41C    P5              20W /  80W |     99MiB /  6144MiB |     21%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

@piotrm-nvidia
Copy link
Contributor Author

piotrm-nvidia commented Jan 20, 2025

I used UCX debugging flags:

export UCX_LOG_LEVEL=DEBUG
export UCXPY_LOG_LEVEL=DEBUG
export UCX_PROTO_INFO=y

The example with this flags prints all UCX logs to standard output, not to log files created in Workers code.

There is an interesting part in log:

[1737382432.501579] [ulalegionbuntu:1367 :1]           async.c:231  UCX  DEBUG added async handler 0x733b3c03f5b0 [id=67 ref 1] ???() to hash
[1737382432.501585] [ulalegionbuntu:1367 :1]           async.c:493  UCX  DEBUG listening to async event fd 67 events 0x5 mode thread_spinlock
[1737382432.501590] [ulalegionbuntu:1367 :1]    tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x733b3c009a90 on cm 0x733b3c034c50 with fd: 67 listening on 0.0.0.0:0
[1737382432.506782] [ulalegionbuntu:1367 :1]          ucp_ep.c:404  UCX  DEBUG created ep 0x733c91e49080 to <no debug data> from api call
[1737382432.506877] [ulalegionbuntu:1367 :1]            sock.c:344  UCX  DEBUG connect(fd=68, src_addr=172.17.0.1:34156 dest_addr=172.17.0.1:38467): Operation now in progress
[1737382432.506896] [ulalegionbuntu:1367 :1]           async.c:231  UCX  DEBUG added async handler 0x733b3c041370 [id=68 ref 1] uct_tcp_sa_data_handler() to hash
[1737382432.506904] [ulalegionbuntu:1367 :1]           async.c:493  UCX  DEBUG listening to async event fd 68 events 0x2 mode thread_spinlock
[1737382432.506908] [ulalegionbuntu:1367 :1]   tcp_sockcm_ep.c:923  UCX  DEBUG created a TCP SOCKCM endpoint (fd=68) on tcp cm 0x733b3c034c50, remote addr: 172.17.0.1:38467
[1737382432.506910] [ulalegionbuntu:1367 :1]   tcp_sockcm_ep.c:1126 UCX  DEBUG client created an endpoint on tcp_sockcm 0x733b3c034c50 id: 68 state: 2
[1737382432.506914] [ulalegionbuntu:1367 :1]       wireup_ep.c:530  UCX  DEBUG ep 0x733c91e49080: wireup_ep 0x733b3c0407d0 set next_ep 0x733b3c040ad0
[1737382432.506922] [ulalegionbuntu:1367 :a]            sock.c:1043 UCX  DEBUG check ifname for socket on 172.17.0.1:0
[1737382432.507080] [ulalegionbuntu:1367 :a]            sock.c:1061 UCX  DEBUG matching ip found iface on docker0
[1737382432.507093] [ulalegionbuntu:1367 :a]       wireup_cm.c:597  UCX  DIAG  client ep 0x733c91e49080 connect to 172.17.0.1:38467 failed: device docker0 is not enabled, enable it in UCX_NET_DEVICES or use corresponding ip address
[1737382432.507097] [ulalegionbuntu:1367 :a]       wireup_cm.c:128  UCX  DIAG  client ep 0x733c91e49080 failed to connect to 172.17.0.1:38467 using tcp cms
[1737382432.507103] [ulalegionbuntu:1367 :a]          uct_cm.c:97   UCX  DIAG  resolve callback failed with error: Destination is unreachable
[1737382432.507107] [ulalegionbuntu:1367 :a]           async.c:156  UCX  DEBUG removed async handler 0x733b3c041370 [id=68 ref 2] uct_tcp_sa_data_handler() from hash

Full log: ucx_enabled_log.txt

@piotrm-nvidia
Copy link
Contributor Author

The root cause seems to be interaction between UCP and UCX in docker environment in the test machine.

  1. Default behavior of Worker class is to not pass any data plane hostname and port.
  2. UCP data plane defaults are used, which uses ucp.get_address(), which returns 0.0.0.0 at this host and random port.
  3. The UCP server and client actually use 172.17.0.1 what is a docker0 interface and it doesn't work with error message wireup_cm.c:597 UCX DIAG client ep 0x733c91e49080 connect to 172.17.0.1:38467 failed: device docker0 is not enabled, enable it in UCX_NET_DEVICES or use corresponding ip address

@piotrm-nvidia
Copy link
Contributor Author

Duplicate of #39

@piotrm-nvidia piotrm-nvidia marked this as a duplicate of #39 Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants