Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在k8S Kuscia点对点runp集群中,用API创建JOB,split的Task报错 #373

Open
wangzeyu135798 opened this issue Jul 10, 2024 · 7 comments

Comments

@wangzeyu135798
Copy link

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

centos7

Kuscia Version

kuscia0.8

Deployment

k8s

deployment Version

1.19

App Running type

secretflow

App Running version

1.7

Configuration file used to run kuscia.

1

What happend and What you expected to happen.

# 在容器内执行示例
export CTR_CERTS_ROOT=/home/kuscia/var/certs
curl -k -X POST 'http://localhost:8082/api/v1/job/create' \
 --header 'Content-Type: application/json' \
 --cert ${CTR_CERTS_ROOT}/kusciaapi-server.crt \
 --key ${CTR_CERTS_ROOT}/kusciaapi-server.key \
 --cacert ${CTR_CERTS_ROOT}/ca.crt \
 -d '{
  "job_id": "job-best-effort-linear",
  "initiator": "alice",
  "max_parallelism": 2,
  "tasks": [
    {
      "task_id": "job-psi-1",
      "app_image": "secretflow-image",
      "parties": [
        {
          "domain_id": "alice",
          "role": "partner"
        },
        {
          "domain_id": "bob",
          "role": "partner"
        }
      ],
      "alias": "job-psi-1",
      "dependencies": [],
      "task_input_config": "{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"bob\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"bob\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"psi\",\"version\":\"0.0.5\",\"attr_paths\":[\"protocol\",\"sort_result\",\"allow_duplicate_keys\",\"allow_duplicate_keys/yes/join_type\",\"allow_duplicate_keys/yes/join_type/left_join/left_side\",\"input/receiver_input/key\",\"input/sender_input/key\"],\"attrs\":[{\"s\":\"PROTOCOL_ECDH\"},{\"b\":true},{\"s\":\"yes\"},{\"s\":\"left_join\"},{\"ss\":[\"alice\"]},{\"ss\":[\"id1\"]},{\"ss\":[\"id2\"]}]},\"sf_input_ids\":[\"alice-table\",\"bob-table\"],\"sf_output_ids\":[\"psi-output-1\"],\"sf_output_uris\":[\"psi-output-1.csv\"]}",
      "priority": 100
    },
    {
      "task_id": "job-split-1",
      "app_image": "secretflow-image",
      "parties": [
        {
          "domain_id": "alice",
          "role": "partner"
        },
        {
          "domain_id": "bob",
          "role": "partner"
        }
      ],
      "alias": "job-split-1",
      "dependencies": [
        "job-psi-1"
      ],
      "task_input_config": "{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"bob\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"bob\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"train_test_split\",\"version\":\"0.0.1\",\"attr_paths\":[\"train_size\",\"test_size\",\"random_state\",\"shuffle\"],\"attrs\":[{\"f\":0.75},{\"f\":0.25},{\"i64\":1234},{\"b\":true}]},\"sf_output_uris\":[\"train-dataset-1.csv\",\"test-dataset-1.csv\"],\"sf_output_ids\":[\"train-dataset-1\",\"test-dataset-1\"],\"sf_input_ids\":[\"psi-output-1\"]}",
      "priority": 100
    }
  ]
}'

k8s  runp模式 直接调用split的api 报错

Kuscia log output.

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  annotations:
    kuscia.secretflow/initiator: alice
    kuscia.secretflow/interconn-bfia-parties: ""
    kuscia.secretflow/interconn-kuscia-parties: bob
    kuscia.secretflow/interconn-self-parties: bob
    kuscia.secretflow/job-id: job-best-effort-linear
    kuscia.secretflow/party-master-domain: bob
    kuscia.secretflow/self-cluster-as-initiator: "false"
    kuscia.secretflow/task-alias: job-split-1
  creationTimestamp: "2024-07-10T07:26:37Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-uid: c74e1f88-5b96-4d44-9480-626a251be466
  name: job-split-1
  namespace: cross-domain
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: job-best-effort-linear
    uid: c74e1f88-5b96-4d44-9480-626a251be466
  resourceVersion: "267961"
  uid: 7a0a7ddd-524c-4061-8f5b-d67a97ad7f3d
spec:
  initiator: alice
  parties:
  - appImageRef: secretflow-image
    domainID: alice
    role: partner
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: bob
    role: partner
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"bob":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","bob"],"devices":[{"name":"spu","type":"spu","parties":["alice","bob"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","bob"],"config":"{\"mode\":
    \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"train_test_split","version":"0.0.1","attr_paths":["train_size","test_size","random_state","shuffle"],"attrs":[{"f":0.75},{"f":0.25},{"i64":1234},{"b":true}]},"sf_output_uris":["train-dataset-1.csv","test-dataset-1.csv"],"sf_output_ids":["train-dataset-1","test-dataset-1"],"sf_input_ids":["psi-output-1"]}'
status:
  allocatedPorts:
  - domainID: alice
    namedPort:
      job-split-1-partner-0/client-server: 31317
      job-split-1-partner-0/fed: 31313
      job-split-1-partner-0/global: 31314
      job-split-1-partner-0/node-manager: 31315
      job-split-1-partner-0/object-manager: 31316
      job-split-1-partner-0/spu: 31318
    role: partner
  - domainID: bob
    namedPort:
      job-split-1-partner-0/client-server: 21793
      job-split-1-partner-0/fed: 21795
      job-split-1-partner-0/global: 21796
      job-split-1-partner-0/node-manager: 21797
      job-split-1-partner-0/object-manager: 21792
      job-split-1-partner-0/spu: 21794
    role: partner
  completionTime: "2024-07-10T07:26:56Z"
  conditions:
  - lastTransitionTime: "2024-07-10T07:26:37Z"
    status: "True"
    type: ResourceCreated
  - lastTransitionTime: "2024-07-10T07:26:40Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2024-07-10T07:26:56Z"
    status: "False"
    type: Success
  lastReconcileTime: "2024-07-10T07:26:56Z"
  message: The remaining no-failed party task counts 1 are less than the threshold
    2 that meets the conditions for task success. pending party[], running party[alice-partner],
    successful party[], failed party[bob-partner]
  partyTaskStatus:
  - domainID: alice
    phase: Failed
    role: partner
  - domainID: bob
    phase: Failed
    role: partner
  phase: Failed
  podStatuses:
    bob/job-split-1-partner-0:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      nodeName: kuscia-autonomy-bob-69b9749d87-dpcql
      podName: job-split-1-partner-0
      podPhase: Failed
      readyTime: "2024-07-10T07:26:40Z"
      reason: Error
      startTime: "2024-07-10T07:26:40Z"
      terminationLog: 'container[secretflow] terminated state reason "Error", message:
        "... Ignore 12450 characters at the beginning ...\n)\x1b[0m 2024-07-10 15:26:48.319
        INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {''proxy_max_restarts'':
        3, ''timeout_in_ms'': 300000, ''recv_timeout_ms'': 604800000, ''connect_retry_times'':
        3600, ''connect_retry_interval_ms'': 1000, ''brpc_channel_protocol'': ''http'',
        ''brpc_channel_connection_type'': ''pooled'', ''exit_on_sending_failure'':
        True}\n\x1b[36m(SenderReceiverProxyActor pid=22634)\x1b[0m I0710 15:26:48.328158
        22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl]
        is serving on port=21795.\n\x1b[36m(SenderReceiverProxyActor pid=22634)\x1b[0m
        W0710 15:26:48.328190 22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1187]
        Builtin services are disabled according to ServerOptions.has_builtin_services\n\x1b[36m(SenderReceiverProxyActor
        pid=22634)\x1b[0m I0710 15:26:49.969624 22690 external/com_github_brpc_brpc/src/brpc/span.cpp:506]
        Opened ./rpc_data/rpcz/20240710.152649.22634/id.db and ./rpc_data/rpcz/20240710.152649.22634/time.db\n2024-07-10
        15:26:52.351 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create
        receiver proxy actor.\n2024-07-10 15:26:52.351 INFO barriers.py:520 [bob]
        -- [Anonymous_job] Try ping [''alice''] at 0 attemp, up to 3600 attemps.\n\x1b[36m(_run
        pid=22355)\x1b[0m WARNING:root:Since the GPL-licensed package `unidecode`
        is not installed, using Python''s `unicodedata` package which yields worse
        results.\n\x1b[33m(raylet)\x1b[0m [2024-07-10 15:26:47,745 I 22634 22634]
        logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL
        to -1\n2024-07-10 15:26:54.246 ERROR component.py:1129 [bob] -- [Anonymous_job]
        eval on domain: \"data_prep\"\nname: \"train_test_split\"\nversion: \"0.0.1\"\nattr_paths:
        \"train_size\"\nattr_paths: \"test_size\"\nattr_paths: \"random_state\"\nattr_paths:
        \"shuffle\"\nattrs {\n  f: 0.75\n}\nattrs {\n  f: 0.25\n}\nattrs {\n  i64:
        1234\n}\nattrs {\n  b: true\n}\ninputs {\n  name: \"psi-output-1.csv\"\n  type:
        \"sf.table.vertical_table\"\n  system_info {\n  }\n  meta {\n    type_url:
        \"type.googleapis.com/secretflow.spec.v1.VerticalTable\"\n    value: \"\\n\\335\\003\\n\\003id1\\022\\003age\\022\\teducation\\022\\007default\\022\\007balance\\022\\007housing\\022\\004loan\\022\\003day\\022\\010duration\\022\\010campaign\\022\\005pdays\\022\\010previous\\022\\017job_blue-collar\\022\\020job_entrepreneur\\022\\rjob_housemaid\\022\\016job_management\\022\\013job_retired\\022\\021job_self-employed\\022\\014job_services\\022\\013job_student\\022\\016job_technician\\022\\016job_unemployed\\022\\020marital_divorced\\022\\017marital_married\\022\\016marital_single\\\"\\003str*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float\\n\\227\\003\\n\\003id2\\022\\020contact_cellular\\022\\021contact_telephone\\022\\017contact_unknown\\022\\tmonth_apr\\022\\tmonth_aug\\022\\tmonth_dec\\022\\tmonth_feb\\022\\tmonth_jan\\022\\tmonth_jul\\022\\tmonth_jun\\022\\tmonth_mar\\022\\tmonth_may\\022\\tmonth_nov\\022\\tmonth_oct\\022\\tmonth_sep\\022\\020poutcome_failure\\022\\016poutcome_other\\022\\020poutcome_success\\022\\020poutcome_unknown\\022\\001y\\\"\\003str*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\003int\\020\\244M\"\n  }\n  data_refs
        {\n    uri: \"psi-output-1.csv\"\n    party: \"alice\"\n    format: \"csv\"\n  }\n  data_refs
        {\n    uri: \"psi-output-1.csv\"\n    party: \"bob\"\n    format: \"csv\"\n  }\n}\noutput_uris:
        \"train-dataset-1.csv\"\noutput_uris: \"test-dataset-1.csv\"\n failed, error
        <\x1b[36mray::_run()\x1b[39m (pid=22355, ip=job-split-1-partner-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 382, in <lambda>\n    lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta(\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 49, in get_file_meta\n    return impl.get_file_meta(remote_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 208, in get_file_meta\n    assert os.path.exists(full_remote_fn)\nAssertionError>\n2024-07-10
        15:26:54.246 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...\n2024-07-10
        15:26:54.246 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.\n2024-07-10
        15:26:54.247 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message
        polling thread[DataSendingQueueThread] to exit.\n2024-07-10 15:26:54.248 INFO
        message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread]
        to exit.\n2024-07-10 15:26:54.248 INFO api.py:384 [bob] -- [Anonymous_job]
        Shutdowned rayfed.\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/runpy.py\",
        line 196, in _run_module_as_main\n    return _run_code(code, main_globals,
        None,\n  File \"/usr/local/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code,
        run_globals)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 547, in <module>\n    main()\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1078, in main\n    rv = self.invoke(ctx)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File
        \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n    return
        __callback(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 527, in main\n    res = comp_eval(sf_node_eval_param, storage_config,
        sf_cluster_config)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\",
        line 166, in comp_eval\n    res = comp.eval(\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1131, in eval\n    raise e from None\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1126, in eval\n    ret = self.__eval_callback(ctx=ctx, **kwargs)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/train_test_split.py\",
        line 103, in train_test_split_eval_fn\n    input_df = load_table(\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 390, in load_table\n    file_metas = reveal(file_metas)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\",
        line 162, in reveal\n    all_object = sfd.get(all_object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\",
        line 156, in get\n    return fed.get(object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/fed/api.py\",
        line 621, in get\n    values = ray.get(ray_refs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\",
        line 22, in auto_init_wrapper\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\",
        line 103, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\",
        line 2624, in get\n    raise value.as_instanceof_cause()\nray.exceptions.RayTaskError(AssertionError):
        \x1b[36mray::_run()\x1b[39m (pid=22355, ip=job-split-1-partner-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 382, in <lambda>\n    lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta(\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 49, in get_file_meta\n    return impl.get_file_meta(remote_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 208, in get_file_meta\n    assert os.path.exists(full_remote_fn)\nAssertionError\n"'
  serviceStatuses:
    bob/job-split-1-partner-0-fed:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      portName: fed
      portNumber: 21795
      readyTime: "2024-07-10T07:26:40Z"
      scope: Cluster
      serviceName: job-split-1-partner-0-fed
    bob/job-split-1-partner-0-global:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      portName: global
      portNumber: 21796
      readyTime: "2024-07-10T07:26:40Z"
      scope: Domain
      serviceName: job-split-1-partner-0-global
    bob/job-split-1-partner-0-spu:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      portName: spu
      portNumber: 21794
      readyTime: "2024-07-10T07:26:40Z"
      scope: Cluster
      serviceName: job-split-1-partner-0-spu
  startTime: "2024-07-10T07:26:37Z"
@wangzeyu135798
Copy link
Author

ad1a304f082e2facfa83626043d1795

@zimu-yuxi
Copy link

方便提供下另一方的报错吗?

@wangzeyu135798
Copy link
Author

另一个方没有报错,日志如下:
[root@kuscia-autonomy-alice-56db7f7ffc-9gzsl logs]# kubectl get kt job-split-1 -oyaml -n cross-domain
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
annotations:
kuscia.secretflow/initiator: alice
kuscia.secretflow/interconn-bfia-parties: ""
kuscia.secretflow/interconn-kuscia-parties: bob
kuscia.secretflow/interconn-self-parties: alice
kuscia.secretflow/job-id: job-best-effort-linear
kuscia.secretflow/self-cluster-as-initiator: "true"
kuscia.secretflow/task-alias: job-split-1
creationTimestamp: "2024-07-10T07:26:37Z"
generation: 1
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/job-uid: bdf0116a-e7f5-4673-981a-8281857c059a
name: job-split-1
namespace: cross-domain
ownerReferences:

  • apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: job-best-effort-linear
    uid: bdf0116a-e7f5-4673-981a-8281857c059a
    resourceVersion: "258184"
    uid: 9fabef3c-36b1-48e9-ac91-b26d6d3a562b
    spec:
    initiator: alice
    parties:
  • appImageRef: secretflow-image
    domainID: alice
    role: partner
    template:
    spec: {}
  • appImageRef: secretflow-image
    domainID: bob
    role: partner
    template:
    spec: {}
    scheduleConfig: {}
    taskInputConfig: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"bob":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","bob"],"devices":[{"name":"spu","type":"spu","parties":["alice","bob"],"config":"{"runtime_config":{"protocol":"REF2K","field":"FM64"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"},{"name":"heu","type":"heu","parties":["alice","bob"],"config":"{"mode":
    "PHEU", "schema": "paillier", "key_size": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"train_test_split","version":"0.0.1","attr_paths":["train_size","test_size","random_state","shuffle"],"attrs":[{"f":0.75},{"f":0.25},{"i64":1234},{"b":true}]},"sf_output_uris":["train-dataset-1.csv","test-dataset-1.csv"],"sf_output_ids":["train-dataset-1","test-dataset-1"],"sf_input_ids":["psi-output-1"]}'
    status:
    allocatedPorts:
  • domainID: alice
    namedPort:
    job-split-1-partner-0/client-server: 30445
    job-split-1-partner-0/fed: 30447
    job-split-1-partner-0/global: 30442
    job-split-1-partner-0/node-manager: 30443
    job-split-1-partner-0/object-manager: 30444
    job-split-1-partner-0/spu: 30446
    role: partner
  • domainID: bob
    namedPort:
    job-split-1-partner-0/client-server: 22177
    job-split-1-partner-0/fed: 22179
    job-split-1-partner-0/global: 22180
    job-split-1-partner-0/node-manager: 22181
    job-split-1-partner-0/object-manager: 22176
    job-split-1-partner-0/spu: 22178
    role: partner
    completionTime: "2024-07-10T07:26:56Z"
    conditions:
  • lastTransitionTime: "2024-07-10T07:26:38Z"
    status: "True"
    type: ResourceCreated
  • lastTransitionTime: "2024-07-10T07:26:41Z"
    status: "True"
    type: Running
  • lastTransitionTime: "2024-07-10T07:26:56Z"
    status: "False"
    type: Success
    lastReconcileTime: "2024-07-10T07:26:56Z"
    message: The remaining no-failed party task counts 1 are less than the threshold
    2 that meets the conditions for task success. pending party[], running party[alice-partner],
    successful party[], failed party[bob-partner]
    partyTaskStatus:
  • domainID: bob
    phase: Failed
    role: partner
  • domainID: alice
    phase: Failed
    role: partner
    phase: Failed
    podStatuses:
    alice/job-split-1-partner-0:
    createTime: "2024-07-10T07:26:37Z"
    namespace: alice
    nodeName: kuscia-autonomy-alice-56db7f7ffc-9gzsl
    podName: job-split-1-partner-0
    podPhase: Failed
    readyTime: "2024-07-10T07:26:40Z"
    startTime: "2024-07-10T07:26:40Z"
    serviceStatuses:
    alice/job-split-1-partner-0-fed:
    createTime: "2024-07-10T07:26:37Z"
    namespace: alice
    portName: fed
    portNumber: 30447
    readyTime: "2024-07-10T07:26:41Z"
    scope: Cluster
    serviceName: job-split-1-partner-0-fed
    alice/job-split-1-partner-0-global:
    createTime: "2024-07-10T07:26:37Z"
    namespace: alice
    portName: global
    portNumber: 30442
    readyTime: "2024-07-10T07:26:41Z"
    scope: Domain
    serviceName: job-split-1-partner-0-global
    alice/job-split-1-partner-0-spu:
    createTime: "2024-07-10T07:26:37Z"
    namespace: alice
    portName: spu
    portNumber: 30446
    readyTime: "2024-07-10T07:26:41Z"
    scope: Cluster
    serviceName: job-split-1-partner-0-spu
    startTime: "2024-07-10T07:26:38Z"

@wangzeyu135798
Copy link
Author

WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.
2024-07-10 15:26:43,016|bob|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='job-split-1-partner-0-global.bob.svc', ray_node_manager_port=21797, ray_object_manager_port=21792, ray_client_server_port=21793, ray_worker_ports=[], ray_gcs_port=21796)
2024-07-10 15:26:43,016|bob|INFO|secretflow|entry.py:start_ray:63| Trying to start ray head node at job-split-1-partner-0-global.bob.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=8 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=job-split-1-partner-0-global.bob.svc --port=21796 --node-manager-port=21797 --object-manager-port=21792 --ray-client-server-port=21793
2024-07-10 15:26:46,577|bob|INFO|secretflow|entry.py:start_ray:80| 2024-07-10 15:26:43,625 INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-07-10 15:26:43,626 INFO scripts.py:744 -- Local node IP: job-split-1-partner-0-global.bob.svc
2024-07-10 15:26:46,415 SUCC scripts.py:781 -- --------------------
2024-07-10 15:26:46,415 SUCC scripts.py:782 -- Ray runtime started.
2024-07-10 15:26:46,415 SUCC scripts.py:783 -- --------------------
2024-07-10 15:26:46,415 INFO scripts.py:785 -- Next steps
2024-07-10 15:26:46,415 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-07-10 15:26:46,415 INFO scripts.py:791 -- ray start --address='job-split-1-partner-0-global.bob.svc:21796'
2024-07-10 15:26:46,416 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-07-10 15:26:46,416 INFO scripts.py:802 -- import ray
2024-07-10 15:26:46,416 INFO scripts.py:803 -- ray.init(_node_ip_address='job-split-1-partner-0-global.bob.svc')
2024-07-10 15:26:46,416 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-07-10 15:26:46,416 INFO scripts.py:835 -- ray stop
2024-07-10 15:26:46,416 INFO scripts.py:838 -- To view the status of the cluster, use
2024-07-10 15:26:46,416 INFO scripts.py:839 -- ray status

2024-07-10 15:26:46,577|bob|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at job-split-1-partner-0-global.bob.svc.
2024-07-10 15:26:46,578|bob|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param {
"domain": "data_prep",
"name": "train_test_split",
"version": "0.0.1",
"attrPaths": [
"train_size",
"test_size",
"random_state",
"shuffle"
],
"attrs": [
{
"f": 0.75
},
{
"f": 0.25
},
{
"i64": "1234"
},
{
"b": true
}
]
}
2024-07-10 15:26:46,585|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id psi-output-1 to
...........
name: "psi-output-1.csv"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "psi-output-1.csv"
party: "alice"
format: "csv"
}
data_refs {
uri: "psi-output-1.csv"
party: "bob"
format: "csv"
}

....
2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:159|

Secretflow 1.6.0b0
Build time (May 21 2024, 06:18:47) with commit id: ba76e1fe43cf3daa0c91423a660f318810c88030

2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:160|

param

domain: "data_prep"
name: "train_test_split"
version: "0.0.1"
attr_paths: "train_size"
attr_paths: "test_size"
attr_paths: "random_state"
attr_paths: "shuffle"
attrs {
f: 0.75
}
attrs {
f: 0.25
}
attrs {
i64: 1234
}
attrs {
b: true
}
inputs {
name: "psi-output-1.csv"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "psi-output-1.csv"
party: "alice"
format: "csv"
}
data_refs {
uri: "psi-output-1.csv"
party: "bob"
format: "csv"
}
}
output_uris: "train-dataset-1.csv"
output_uris: "test-dataset-1.csv"

--

2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:161|

storage_config

type: "local_fs"
local_fs {
wd: "/home/kuscia/var/storage/data"
}

--

2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:162|

cluster_config

desc {
parties: "alice"
parties: "bob"
devices {
name: "spu"
type: "spu"
parties: "alice"
parties: "bob"
config: "{"runtime_config":{"protocol":"REF2K","field":"FM64"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
}
devices {
name: "heu"
type: "heu"
parties: "alice"
parties: "bob"
config: "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
}
ray_fed_config {
cross_silo_comm_backend: "brpc_link"
}
}
public_config {
ray_fed_config {
parties: "alice"
parties: "bob"
addresses: "job-split-1-partner-0-fed.alice.svc:80"
addresses: "0.0.0.0:21795"
}
spu_configs {
name: "spu"
parties: "alice"
parties: "bob"
addresses: "http://job-split-1-partner-0-spu.alice.svc:80"
addresses: "0.0.0.0:21794"
}
}
private_config {
self_party: "bob"
ray_head_addr: "job-split-1-partner-0-global.bob.svc:21796"
}

--

2024-07-10 15:26:46,586|bob|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-07-10 15:26:46,586 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: job-split-1-partner-0-global.bob.svc:21796...
2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock
2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377488 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock
2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock
2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377488 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock
2024-07-10 15:26:46,595|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377536 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377536 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377536 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377536 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377440 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377440 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377440 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377440 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377632 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377632 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377632 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377632 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377488 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377488 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock
2024-07-10 15:26:46,597 INFO worker.py:1724 -- Connected to Ray cluster.
2024-07-10 15:26:47.306 INFO api.py:233 [bob] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'alice': 'http://job-split-1-partner-0-fed.alice.svc:80', 'bob': '0.0.0.0:21795'}, 'CURRENT_PARTY_NAME': 'bob', 'TLS_CONFIG': {}}
�[33m(raylet)�[0m [2024-07-10 15:26:47,275 I 22355 22355] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
�[36m(SenderReceiverProxyActor pid=22634)�[0m 2024-07-10 15:26:48.319 INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
�[36m(SenderReceiverProxyActor pid=22634)�[0m I0710 15:26:48.328158 22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=21795.
�[36m(SenderReceiverProxyActor pid=22634)�[0m W0710 15:26:48.328190 22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
�[36m(SenderReceiverProxyActor pid=22634)�[0m I0710 15:26:49.969624 22690 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240710.152649.22634/id.db and ./rpc_data/rpcz/20240710.152649.22634/time.db
2024-07-10 15:26:52.351 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-07-10 15:26:52.351 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps.
�[36m(_run pid=22355)�[0m WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.
�[33m(raylet)�[0m [2024-07-10 15:26:47,745 I 22634 22634] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
2024-07-10 15:26:54.246 ERROR component.py:1129 [bob] -- [Anonymous_job] eval on domain: "data_prep"
name: "train_test_split"
version: "0.0.1"
attr_paths: "train_size"
attr_paths: "test_size"
attr_paths: "random_state"
attr_paths: "shuffle"
attrs {
f: 0.75
}
attrs {
f: 0.25
}
attrs {
i64: 1234
}
attrs {
b: true
}
inputs {
name: "psi-output-1.csv"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "psi-output-1.csv"
party: "alice"
format: "csv"
}
data_refs {
uri: "psi-output-1.csv"
party: "bob"
format: "csv"
}
}
output_uris: "train-dataset-1.csv"
output_uris: "test-dataset-1.csv"
failed, error <�[36mray::_run()�[39m (pid=22355, ip=job-split-1-partner-0-global.bob.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 382, in
lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 49, in get_file_meta
return impl.get_file_meta(remote_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 208, in get_file_meta
assert os.path.exists(full_remote_fn)
AssertionError>
2024-07-10 15:26:54.246 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-07-10 15:26:54.246 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.
2024-07-10 15:26:54.247 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-07-10 15:26:54.248 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-07-10 15:26:54.248 INFO api.py:384 [bob] -- [Anonymous_job] Shutdowned rayfed.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in
main()
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 166, in comp_eval
res = comp.eval(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1131, in eval
raise e from None
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1126, in eval
ret = self.__eval_callback(ctx=ctx, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/train_test_split.py", line 103, in train_test_split_eval_fn
input_df = load_table(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 390, in load_table
file_metas = reveal(file_metas)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
all_object = sfd.get(all_object_refs)
File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
return fed.get(object_refs)
File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
values = ray.get(ray_refs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): �[36mray::_run()�[39m (pid=22355, ip=job-split-1-partner-0-global.bob.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 382, in
lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 49, in get_file_meta
return impl.get_file_meta(remote_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 208, in get_file_meta
assert os.path.exists(full_remote_fn)
AssertionError

@wangzeyu135798
Copy link
Author

WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.
2024-07-10 15:26:43,898|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='job-split-1-partner-0-global.alice.svc', ray_node_manager_port=25098, ray_object_manager_port=25099, ray_client_server_port=25100, ray_worker_ports=[], ray_gcs_port=25097)
2024-07-10 15:26:43,898|alice|INFO|secretflow|entry.py:start_ray:63| Trying to start ray head node at job-split-1-partner-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=4 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=job-split-1-partner-0-global.alice.svc --port=25097 --node-manager-port=25098 --object-manager-port=25099 --ray-client-server-port=25100
2024-07-10 15:26:47,658|alice|INFO|secretflow|entry.py:start_ray:80| 2024-07-10 15:26:44,592 INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-07-10 15:26:44,592 INFO scripts.py:744 -- Local node IP: job-split-1-partner-0-global.alice.svc
2024-07-10 15:26:47,518 SUCC scripts.py:781 -- --------------------
2024-07-10 15:26:47,518 SUCC scripts.py:782 -- Ray runtime started.
2024-07-10 15:26:47,518 SUCC scripts.py:783 -- --------------------
2024-07-10 15:26:47,518 INFO scripts.py:785 -- Next steps
2024-07-10 15:26:47,518 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-07-10 15:26:47,518 INFO scripts.py:791 -- ray start --address='job-split-1-partner-0-global.alice.svc:25097'
2024-07-10 15:26:47,518 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-07-10 15:26:47,518 INFO scripts.py:802 -- import ray
2024-07-10 15:26:47,519 INFO scripts.py:803 -- ray.init(_node_ip_address='job-split-1-partner-0-global.alice.svc')
2024-07-10 15:26:47,519 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-07-10 15:26:47,519 INFO scripts.py:835 -- ray stop
2024-07-10 15:26:47,519 INFO scripts.py:838 -- To view the status of the cluster, use
2024-07-10 15:26:47,519 INFO scripts.py:839 -- ray status

2024-07-10 15:26:47,658|alice|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at job-split-1-partner-0-global.alice.svc.
2024-07-10 15:26:47,659|alice|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param {
"domain": "data_prep",
"name": "train_test_split",
"version": "0.0.1",
"attrPaths": [
"train_size",
"test_size",
"random_state",
"shuffle"
],
"attrs": [
{
"f": 0.75
},
{
"f": 0.25
},
{
"i64": "1234"
},
{
"b": true
}
]
}
2024-07-10 15:26:47,667|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id psi-output-1 to
...........
name: "psi-output-1.csv"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "psi-output-1.csv"
party: "alice"
format: "csv"
}
data_refs {
uri: "psi-output-1.csv"
party: "bob"
format: "csv"
}

....
2024-07-10 15:26:47,667|alice|WARNING|secretflow|entry.py:comp_eval:159|

Secretflow 1.6.0b0
Build time (May 21 2024, 06:18:47) with commit id: ba76e1fe43cf3daa0c91423a660f318810c88030

2024-07-10 15:26:47,667|alice|WARNING|secretflow|entry.py:comp_eval:160|

param

domain: "data_prep"
name: "train_test_split"
version: "0.0.1"
attr_paths: "train_size"
attr_paths: "test_size"
attr_paths: "random_state"
attr_paths: "shuffle"
attrs {
f: 0.75
}
attrs {
f: 0.25
}
attrs {
i64: 1234
}
attrs {
b: true
}
inputs {
name: "psi-output-1.csv"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003str*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "psi-output-1.csv"
party: "alice"
format: "csv"
}
data_refs {
uri: "psi-output-1.csv"
party: "bob"
format: "csv"
}
}
output_uris: "train-dataset-1.csv"
output_uris: "test-dataset-1.csv"

--

2024-07-10 15:26:47,667|alice|WARNING|secretflow|entry.py:comp_eval:161|

storage_config

type: "local_fs"
local_fs {
wd: "/home/kuscia/var/storage/data"
}

--

2024-07-10 15:26:47,668|alice|WARNING|secretflow|entry.py:comp_eval:162|

cluster_config

desc {
parties: "alice"
parties: "bob"
devices {
name: "spu"
type: "spu"
parties: "alice"
parties: "bob"
config: "{"runtime_config":{"protocol":"REF2K","field":"FM64"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
}
devices {
name: "heu"
type: "heu"
parties: "alice"
parties: "bob"
config: "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
}
ray_fed_config {
cross_silo_comm_backend: "brpc_link"
}
}
public_config {
ray_fed_config {
parties: "alice"
parties: "bob"
addresses: "0.0.0.0:25102"
addresses: "job-split-1-partner-0-fed.bob.svc:80"
}
spu_configs {
name: "spu"
parties: "alice"
parties: "bob"
addresses: "0.0.0.0:25101"
addresses: "http://job-split-1-partner-0-spu.bob.svc:80"
}
}
private_config {
self_party: "alice"
ray_head_addr: "job-split-1-partner-0-global.alice.svc:25097"
}

--

2024-07-10 15:26:47,668|alice|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-07-10 15:26:47,669 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: job-split-1-partner-0-global.alice.svc:25097...
2024-07-10 15:26:47,675|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock
2024-07-10 15:26:47,676|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753376 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock
2024-07-10 15:26:47,676|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock
2024-07-10 15:26:47,676|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753376 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock
2024-07-10 15:26:47,679|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753424 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,679|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753424 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,679|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753424 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,680|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753424 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,680|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753328 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,680|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753328 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753328 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753328 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753520 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753520 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753520 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753520 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,682|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753376 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,682|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,682|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753376 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock
2024-07-10 15:26:47,682 INFO worker.py:1724 -- Connected to Ray cluster.
2024-07-10 15:26:48.575 INFO api.py:233 [alice] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'alice': '0.0.0.0:25102', 'bob': 'http://job-split-1-partner-0-fed.bob.svc:80'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}}
�[33m(raylet)�[0m [2024-07-10 15:26:49,120 I 10119 10119] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
�[36m(SenderReceiverProxyActor pid=10119)�[0m 2024-07-10 15:26:49.883 INFO link.py:38 [alice] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
�[36m(SenderReceiverProxyActor pid=10119)�[0m I0710 15:26:49.891576 10119 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=25102.
�[36m(SenderReceiverProxyActor pid=10119)�[0m W0710 15:26:49.891611 10119 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-07-10 15:26:52.346 INFO barriers.py:465 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-07-10 15:26:52.347 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps.
�[36m(SenderReceiverProxyActor pid=10119)�[0m I0710 15:26:52.410292 10214 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240710.152652.10119/id.db and ./rpc_data/rpcz/20240710.152652.10119/time.db
�[36m(_run pid=9847)�[0m WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.

@aokaokd
Copy link

aokaokd commented Jul 15, 2024

你到bob的容器内,看下psi-output-1.csv 这个文件生成了吗

Copy link

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants