Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

基于docker部署时,KKRT/RR22协议失败 #366

Open
coderSun20201112 opened this issue Jul 5, 2024 · 11 comments
Open

基于docker部署时,KKRT/RR22协议失败 #366

coderSun20201112 opened this issue Jul 5, 2024 · 11 comments

Comments

@coderSun20201112
Copy link

Issue Type

Others

Search for existing issues similar to yours

Yes

Kuscia Version

kuscia 0.5.0

Link to Relevant Documentation

No response

Question Details

我使用docker镜像方式按照P2P方式组网,一个节点是人行,另外一个节点是商行,进行PSI隐私求交,测试时ECDH协议很顺利,但KKRT/RR22一直报错,因此,寻求解决方法,下面是日志信息:

RR22的日志信息:
2024-07-05T10:54:04.02101077+08:00 stdout F [2024-07-05 10:54:04.020] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[43b39692bdf27cbe];[content-length]:[95];[kuscia-error-message]:[Domain shanghang.root-kuscia-autonomy-shanghang<--Domain renhang.root-kuscia-autonomy-renhang<--10.2.11.62 return http code 503.];[x-accel-buffering]:[no];[x-b3-spanid]:[43b39692bdf27cbe];[x-envoy-upstream-service-time]:[74];[date]:[Fri, 05 Jul 2024 02:54:03 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection termination'
2024-07-05T10:54:05.023777727+08:00 stdout F [2024-07-05 10:54:05.023] [info] [channel.cc:352] send request failed and retry, retry_count=2, max_retry=3, interval_ms=3000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[d140d7eeeca679af];[content-length]:[145];[kuscia-error-message]:[Domain shanghang.root-kuscia-autonomy-shanghang<--Domain renhang.root-kuscia-autonomy-renhang<--10.2.11.62 return http code 503.];[x-accel-buffering]:[no];[x-b3-spanid]:[d140d7eeeca679af];[x-envoy-upstream-service-time]:[1];[date]:[Fri, 05 Jul 2024 02:54:04 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111'

KKRT日志信息:
2024-07-05T09:50:32.168913916+08:00 stdout F [2024-07-05 09:50:32.168] [info] [csv_checker.cc:241] Executing script to get duplicates: LC_ALL=C tail -n +2 /tmp/685ce8dc-7bb6-4242-bcef-3d55ab800137.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /tmp/685ce8dc-7bb6-4242-bcef-3d55ab800137.psi_checked_duplicates
2024-07-05T09:50:37.792264335+08:00 stdout F [2024-07-05 09:50:37.792] [info] [csv_checker.cc:271] Executing script to get hash digest: sha256sum /tmp/685ce8dc-7bb6-4242-bcef-3d55ab800137.psi_checked
2024-07-05T09:50:39.812578523+08:00 stdout F [2024-07-05 09:50:39.812] [info] [interface.cc:143] [AbstractPsiParty::Init][Check csv pre-process] end
2024-07-05T09:50:39.819929341+08:00 stdout F [2024-07-05 09:50:39.819] [info] [interface.cc:183] [AbstractPsiParty::Init] end
2024-07-05T09:50:39.820391329+08:00 stdout F [2024-07-05 09:50:39.820] [info] [receiver.cc:42] [KkrtPsiReceiver::Init] end
2024-07-05T09:50:39.820403246+08:00 stdout F [2024-07-05 09:50:39.820] [info] [receiver.cc:47] [KkrtPsiReceiver::PreProcess] start
2024-07-05T09:50:39.820478147+08:00 stdout F [2024-07-05 09:50:39.820] [info] [bucket_psi.cc:515] psi protocol=2, rank=0 item_size=10000
2024-07-05T09:50:39.82048501+08:00 stdout F [2024-07-05 09:50:39.820] [info] [bucket_psi.cc:515] psi protocol=2, rank=1 item_size=10000000
2024-07-05T09:50:51.568753187+08:00 stdout F [2024-07-05 09:50:51.568] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/alice_psi.csv.
2024-07-05T09:50:51.569585197+08:00 stdout F [2024-07-05 09:50:51.569] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/alice_psi.csv.
@aokaokd
Copy link

aokaokd commented Jul 5, 2024

你的数据里面包含重复数据吗

@coderSun20201112
Copy link
Author

你的数据里面包含重复数据吗

好的,我检查一下数据,我问问业务部门

@coderSun20201112
Copy link
Author

你的数据里面包含重复数据吗

我新造了1000条测试数据,其中交集是560条,且这560条记录的"身份证号码"各不相同,而我也是用“身份证号码”作为求交列,即便这样,还是失败

@aokaokd
Copy link

aokaokd commented Jul 8, 2024

好的, 失败日志和上面相同吗。

@coderSun20201112
Copy link
Author

好的, 失败日志和上面相同吗。

相同

@zimu-yuxi
Copy link

好的, 失败日志和上面相同吗。

相同

是否有更多的任务日志信息。可以在kuscia容器内,/home/kuscia/var/stdout/路径下找到报错任务id的日志

@coderSun20201112
Copy link
Author

好的, 失败日志和上面相同吗。

相同

是否有更多的任务日志信息。可以在kuscia容器内,/home/kuscia/var/stdout/路径下找到报错任务id的日志

基于RR22做了一次测试,下面是日志信息:

pod下的日志
2024-07-10T18:25:45.296799503+08:00 stdout F [2024-07-10 18:25:45.281] [info] [main.cc:44] SecretFlow PSI Library v0.2.0.dev240123 Copyright 2023 Ant Group Co., Ltd.
2024-07-10T18:25:45.299321156+08:00 stdout F [2024-07-10 18:25:45.299] [info] [main.cc:56] Kuscia task id: yqxxeraj
2024-07-10T18:25:45.317512483+08:00 stderr F I0710 18:25:45.317143 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=54509.
2024-07-10T18:25:45.317571852+08:00 stderr F W0710 18:25:45.317178 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-07-10T18:25:48.547728713+08:00 stderr F I0710 18:25:48.547527 26 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240710.182548.7/id.db and ./rpc_data/rpcz/20240710.182548.7/time.db
2024-07-10T18:25:51.363771015+08:00 stderr F [978.334] perfetto.cc:45899 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024 KB, total sessions:1, uid:0 session name: ""
2024-07-10T18:25:51.364221936+08:00 stdout F [2024-07-10 18:25:51.364] [info] [launch.cc:115] PSI config: {"protocol_config":{"protocol":"PROTOCOL_RR22","role":"ROLE_SENDER","ecdh_config":{"curve":"CURVE_FOURQ"},"kkrt_config":{"bucket_size":"1048576"},"rr22_config":{"bucket_size":"1048576"}},"input_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/learn_440_1980-01-01.csv"},"output_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/result/yqxxeraj/"},"keys":["证件号码"],"recovery_config":{"enabled":true,"folder":"/home/kuscia/var/storage/data/tmp/yqxxeraj/"},"left_side":"ROLE_RECEIVER"}
2024-07-10T18:25:51.364241907+08:00 stdout F [2024-07-10 18:25:51.364] [info] [sender.cc:35] [Rr22PsiSender::Init] start
2024-07-10T18:25:51.364248729+08:00 stdout F [2024-07-10 18:25:51.364] [info] [interface.cc:76] [AbstractPsiParty::Init] start
2024-07-10T18:25:51.364255072+08:00 stdout F [2024-07-10 18:25:51.364] [warning] [interface.cc:300] check_hash_digest turns off while recovery is enabled. check_hash_digest is modified to true for robustness.
2024-07-10T18:25:51.371614123+08:00 stdout F [2024-07-10 18:25:51.371] [info] [interface.cc:134] [AbstractPsiParty::Init][Check csv pre-process] start
2024-07-10T18:25:51.379577942+08:00 stdout F [2024-07-10 18:25:51.379] [info] [csv_checker.cc:241] Executing script to get duplicates: LC_ALL=C tail -n +2 /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked_duplicates
2024-07-10T18:25:51.414585957+08:00 stdout F [2024-07-10 18:25:51.414] [info] [csv_checker.cc:271] Executing script to get hash digest: sha256sum /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked
2024-07-10T18:25:51.428806196+08:00 stdout F [2024-07-10 18:25:51.428] [info] [interface.cc:143] [AbstractPsiParty::Init][Check csv pre-process] end
2024-07-10T18:25:51.433757927+08:00 stdout F [2024-07-10 18:25:51.433] [info] [interface.cc:183] [AbstractPsiParty::Init] end
2024-07-10T18:25:51.434165661+08:00 stdout F [2024-07-10 18:25:51.433] [info] [sender.cc:40] [Rr22PsiSender::Init] end
2024-07-10T18:25:51.434179781+08:00 stdout F [2024-07-10 18:25:51.434] [info] [sender.cc:45] [Rr22PsiSender::PreProcess] start
2024-07-10T18:25:51.434198772+08:00 stdout F [2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=0 item_size=1000
2024-07-10T18:25:51.434205583+08:00 stdout F [2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=1 item_size=1000
2024-07-10T18:25:51.436311717+08:00 stdout F [2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv.
2024-07-10T18:25:51.436942769+08:00 stdout F [2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv.
2024-07-10T18:25:51.439140404+08:00 stdout F [2024-07-10 18:25:51.439] [info] [sender.cc:79] [Rr22PsiSender::PreProcess] end
2024-07-10T18:25:51.441284522+08:00 stdout F [2024-07-10 18:25:51.441] [info] [sender.cc:84] [Rr22PsiSender::Online] start
2024-07-10T18:25:51.442326142+08:00 stdout F [2024-07-10 18:25:51.441] [info] [recovery.cc:188] RecoveryManager::MarkOnlineStart ecdh_dual_masked_cnt_from_peer_ = 0
2024-07-10T18:25:51.442357509+08:00 stdout F [2024-07-10 18:25:51.441] [info] [recovery.cc:192] RecoveryManager::MarkOnlineStart parsed_bucket_count_from_peer_ = 0
2024-07-10T18:25:51.446471601+08:00 stdout F [2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=0, inputs_size=1000
2024-07-10T18:25:51.446489556+08:00 stdout F [2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=1, inputs_size=1000
2024-07-10T18:25:51.44651467+08:00 stdout F [2024-07-10 18:25:51.446] [info] [bucket.cc:50] run psi bucket_idx=0, bucket_item_size=1000
2024-07-10T18:25:51.448829406+08:00 stdout F [2024-07-10 18:25:51.448] [info] [thread_pool.cc:30] Create a fixed thread pool with size 7
2024-07-10T18:25:51.45112501+08:00 stdout F [2024-07-10 18:25:51.450] [info] [rr22_oprf.cc:139] recv paxos seed...
2024-07-10T18:25:51.456876126+08:00 stdout F [2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:145] recv paxos seed finished
2024-07-10T18:25:51.4569054+08:00 stdout F [2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:176] begin vole send

pod配置信息
[root@root-kuscia-autonomy-renhang kuscia]# kubectl describe pods yqxxeraj-0 --namespace=renhang
Name: yqxxeraj-0
Namespace: renhang
Priority: 0
Service Account: default
Node: root-kuscia-autonomy-renhang/172.18.0.3
Start Time: Wed, 10 Jul 2024 18:25:42 +0800
Labels: kuscia.secretflow/communication-role-client=true
kuscia.secretflow/communication-role-server=true
kuscia.secretflow/controller=kusciatask
kuscia.secretflow/initiator=renhang
kuscia.secretflow/interconn-protocol-type=kuscia
kuscia.secretflow/task-id=yqxxeraj
kuscia.secretflow/task-resource=yqxxeraj-d74ad9f504a9
kuscia.secretflow/task-resource-group=yqxxeraj
task.kuscia.secretflow/pod-name=yqxxeraj-0
task.kuscia.secretflow/pod-role=
Annotations: kuscia.secretflow/config-template-volumes: config-template
kuscia.secretflow/image-id: sha256:ae331537eb75b273358b63a7b67d7aa80c190888cb38064360db5e60b6540b15
kuscia.secretflow/taskresource-reserving-timestamp: 2024-07-10T18:25:42+08:00
Status: Failed
IP:
IPs:
Controlled By: KusciaTask/yqxxeraj
Containers:
secretflow:
Container ID: containerd://73e2004ceeed41086797dcb848597dddd9c1713c5d71724de521330def4593c7
Image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/psi-anolis8:0.2.0.dev240123
Image ID: sha256:ae331537eb75b273358b63a7b67d7aa80c190888cb38064360db5e60b6540b15
Port: 54509/TCP
Host Port: 0/TCP
Command:
sh
Args:
-c
/root/main --kuscia /etc/kuscia/task-config.conf
State: Terminated
Reason: Error
Message: 1G --stable | LC_ALL=C uniq -d > /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked_duplicates
[2024-07-10 18:25:51.414] [info] [csv_checker.cc:271] Executing script to get hash digest: sha256sum /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked
[2024-07-10 18:25:51.428] [info] [interface.cc:143] [AbstractPsiParty::Init][Check csv pre-process] end
[2024-07-10 18:25:51.433] [info] [interface.cc:183] [AbstractPsiParty::Init] end
[2024-07-10 18:25:51.433] [info] [sender.cc:40] [Rr22PsiSender::Init] end
[2024-07-10 18:25:51.434] [info] [sender.cc:45] [Rr22PsiSender::PreProcess] start
[2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=0 item_size=1000
[2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=1 item_size=1000
[2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv.
[2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv.
[2024-07-10 18:25:51.439] [info] [sender.cc:79] [Rr22PsiSender::PreProcess] end
[2024-07-10 18:25:51.441] [info] [sender.cc:84] [Rr22PsiSender::Online] start
[2024-07-10 18:25:51.441] [info] [recovery.cc:188] RecoveryManager::MarkOnlineStart ecdh_dual_masked_cnt_from_peer_ = 0
[2024-07-10 18:25:51.441] [info] [recovery.cc:192] RecoveryManager::MarkOnlineStart parsed_bucket_count_from_peer_ = 0
[2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=0, inputs_size=1000
[2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=1, inputs_size=1000
[2024-07-10 18:25:51.446] [info] [bucket.cc:50] run psi bucket_idx=0, bucket_item_size=1000
[2024-07-10 18:25:51.448] [info] [thread_pool.cc:30] Create a fixed thread pool with size 7
[2024-07-10 18:25:51.450] [info] [rr22_oprf.cc:139] recv paxos seed...
[2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:145] recv paxos seed finished
[2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:176] begin vole send

  Exit Code:    132
  Started:      Wed, 10 Jul 2024 18:25:45 +0800
  Finished:     Wed, 10 Jul 2024 18:25:51 +0800
Ready:          False
Restart Count:  0
Environment:
  TASK_ID:              yqxxeraj
  TASK_CLUSTER_DEFINE:  {"parties":[{"name":"shanghang", "role":"", "services":[{"portName":"psi", "endpoints":["yqxxeraj-0-psi.shanghang.svc"]}]}, {"name":"renhang", "role":"", "services":[{"portName":"psi", "endpoints":["yqxxeraj-0-psi.renhang.svc"]}]}], "selfPartyIdx":1, "selfEndpointIdx":0}
  ALLOCATED_PORTS:      {"ports":[{"name":"psi", "port":54509, "scope":"Cluster", "protocol":"HTTP"}]}
  TASK_INPUT_CONFIG:    {
                          "sf_psi_config_map": {
                            "shanghang": {
                              "link_config": {
                                "recv_timeout_ms": "30000",
                                "http_timeout_ms": 30000
                              },
                              "psi_config": {
                                "protocol_config": {
                                  "protocol": "PROTOCOL_RR22",
                                  "role": "ROLE_RECEIVER",
                                  "ecdh_config": {
                                    "curve": "CURVE_FOURQ"
                                  },
                                  "kkrt_config": {
                                    "bucket_size": "1048576"
                                  },
                                  "rr22_config": {
                                    "bucket_size": "1048576"
                                  }
                                },
                                "input_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/learn_440_1970-01-01.csv"
                                },
                                "output_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/result/yqxxeraj/result2.csv"
                                },
                                "keys": ["证件号码"],
                                "recovery_config": {
                                  "enabled": true,
                                  "folder": "/home/kuscia/var/storage/data/tmp/yqxxeraj/"
                                },
                                "left_side": "ROLE_RECEIVER"
                              }
                            },
                            "renhang": {
                              "link_config": {
                                "recv_timeout_ms": "30000",
                                "http_timeout_ms": 30000
                              },
                              "psi_config": {
                                "protocol_config": {
                                  "protocol": "PROTOCOL_RR22",
                                  "role": "ROLE_SENDER",
                                  "ecdh_config": {
                                    "curve": "CURVE_FOURQ"
                                  },
                                  "kkrt_config": {
                                    "bucket_size": "1048576"
                                  },
                                  "rr22_config": {
                                    "bucket_size": "1048576"
                                  }
                                },
                                "input_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/learn_440_1980-01-01.csv"
                                },
                                "output_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/result/yqxxeraj/"
                                },
                                "keys": ["证件号码"],
                                "recovery_config": {
                                  "enabled": true,
                                  "folder": "/home/kuscia/var/storage/data/tmp/yqxxeraj/"
                                },
                                "left_side": "ROLE_RECEIVER"
                              }
                            }
                          }
                        }
Mounts:
  /etc/kuscia/task-config.conf from config-template (rw,path="task-config.conf")

Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-template:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: yqxxeraj-configtemplate
Optional: false
QoS Class: BestEffort
Node-Selectors: kuscia.secretflow/namespace=renhang
Tolerations: kuscia.secretflow/agent:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Warning FailedScheduling 3m16s kuscia-scheduler 0/1 nodes are available: failed to get task resource renhang/ for pod. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod., can not find related task resource.
Normal Scheduled 3m14s kuscia-scheduler Successfully assigned renhang/yqxxeraj-0 to root-kuscia-autonomy-renhang
Normal Pulled 3m14s Agent Container image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/psi-anolis8:0.2.0.dev240123" already present on machine
Normal Created 3m13s Agent Created container secretflow
Normal Started 3m12s Agent Started container secretflow
Warning MissingClusterDNS 3m11s (x4 over 3m15s) Agent pod: "yqxxeraj-0_renhang(0d848060-9ff7-42f3-9470-ce5c66fc3454)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
[root@root-kuscia-autonomy-renhang kuscia]#

@coderSun20201112
Copy link
Author

我看历史issues中,有人提到avx、avx2,是不是对CPU有要求?

@wenkesong-li
Copy link

你好,avx、avx2需要cpu对avx指令集支持~

@coderSun20201112
Copy link
Author

你好,avx、avx2需要cpu对avx指令集支持~

那我KKRT/RR22执行失败,通过日志能看出是因为我方服务器不支持avx/avx2吗?如果不是avx/avx2的问题,那我该如何解决这个问题?

Copy link

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants