Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scql执行run命令 timeout参数设置不起作用 #365

Closed
friendsAI opened this issue Sep 26, 2024 · 18 comments
Closed

scql执行run命令 timeout参数设置不起作用 #365

friendsAI opened this issue Sep 26, 2024 · 18 comments

Comments

@friendsAI
Copy link

Issue Type

Running

Have you searched for existing issues?

Yes

OS Platform and Distribution

linux v10

SCQL Version

0.8.1b

What happend and What you expected to happen.

执行run 命令,设置 --timeout 参数不起作用。设置为500秒,很快就报超时,任务没有执行成功。请问,--timeout 参数,是不是有个默认的上限值,为什么没有按照用户实际设置的来?

Configuration used to run SCQL.

报超时错误。

SCQL log output.

报超时错误。
@BrainWH
Copy link

BrainWH commented Sep 27, 2024

你好,可以贴一下执行命令和日志的信息

@friendsAI
Copy link
Author

命令如下:
./brokerctl run "select cloud2.ID,cloud2.name,cloud2.cardno from cloud2 inner join ga on ga.ID=cloud2.ID;" --project-id "9ff175c9ca1f49bca5f363a88f4ffbd7d" --host "http://192.168.90.171:8080" --timeout 500
broker 部分日志如下:
2024-09-27 02:54:49.9272 ERROR executor.go:98 |RequestID:|SessionID:bad8437a-7c7b-11ef-a9b6-0242ac170002|ActionName:EngineStub@RunExecutionPlan|CostTime:54.583204491s|Reason:InvalidResponse|ErrorMsg:Error: code=320, msg="RunExecutionPlan create session(bad8437a-7c7b-11ef-a9b6-0242ac170002) failed, catch std::exception=[external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=bad8437a-7c7b-11ef-a9b6-0242ac170002:2:ALLGATHER "|Request:
2024-09-27 02:54:49.9272 ERROR common.go:139 |RequestID:|RequestParty:|SessionID:|ActionName:Intra@DoQuery|CostTime:56.623403125s|Reason:|ErrorMsg:runQuery Execute err: Error: code=320, msg="RunExecutionPlan create session(bad8437a-7c7b-11ef-a9b6-0242ac170002) failed, catch std::exception=[external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=bad8437a-7c7b-11ef-a9b6-0242ac170002:2:ALLGATHER "|Request:project_id:"9ff175c9ca1f49bca5f363a88f4ffbd7d" query:"select cloud2.ID,cloud2.name,cloud2.cardno from cloud2 inner join ga on ga.ID=cloud2.ID;" debug_opts:{} job_config:{}
2024-09-27 02:54:49.9272 INFO server.go:135 |GIN|status=200|method=POST|path=/intra/query|ip=192.168.90.171|latency=56.623593684s|
engine 部分日志如下:
2024-09-27 02:53:07.267 [info] [engine_service_impl.cc:RunPlanSync:571] [job(6fd502f0-7c7b-11ef-a9b6-0242ac170002)] RunExecutionPlan success, sessionID=6fd502f0-7c7b-11ef-a9b6-0242ac170002
2024-09-27 02:53:15.927 [info] [session_manager.cc:RemoveSession:226] [scqlengine] session(6fd502f0-7c7b-11ef-a9b6-0242ac170002) removed, running_cost(86796ms), current running session=0
2024-09-27 02:53:15.928 [warning] [session_manager.cc:GetSession:156] [scqlengine] session(6fd502f0-7c7b-11ef-a9b6-0242ac170002) not exists. default return nullptr.
2024-09-27 02:53:15.928 [warning] [session.cc:ActiveLogger:336] [scqlengine] can not get valid session
2024-09-27 02:53:15.928 [info] [engine_service_impl.cc:StopJob:173] [scqlengine] EngineServiceImpl::StopJob(6fd502f0-7c7b-11ef-a9b6-0242ac170002), reason()
2024-09-27 02:53:15.928 [warning] [session_manager.cc:StopSession:174] [scqlengine] session(6fd502f0-7c7b-11ef-a9b6-0242ac170002) not exists.
2024-09-27 02:53:54.428 [warning] [listener.cc:GetListener:62] [scqlengine] Listener for link_id:bad8437a-7c7b-11ef-a9b6-0242ac170002 not exist.
2024-09-27 02:54:47.525 [warning] [session_manager.cc:GetSession:156] [scqlengine] session(bad8437a-7c7b-11ef-a9b6-0242ac170002) not exists. default return nullptr.
2024-09-27 02:54:47.539 [error] [engine_service_impl.cc:RunExecutionPlan:301] [scqlengine] RunExecutionPlan create session(bad8437a-7c7b-11ef-a9b6-0242ac170002) failed, catch std::exception=[external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=bad8437a-7c7b-11ef-a9b6-0242ac170002:2:ALLGATHER
2024-09-27 02:54:47.540 [warning] [session_manager.cc:GetSession:156] [scqlengine] session(bad8437a-7c7b-11ef-a9b6-0242ac170002) not exists. default return nullptr.
2024-09-27 02:54:47.540 [warning] [session.cc:ActiveLogger:336] [scqlengine] can not get valid session
2024-09-27 02:54:49.553 [warning] [engine_service_impl.cc:ReportErrorToPeers:661] [scqlengine] sync error to peer=(1200001725521632516,192.168.90.171:5006) failed: status: code: 141
message: "no session for job_id=bad8437a-7c7b-11ef-a9b6-0242ac170002"

2024-09-27 02:54:49.555 [warning] [session_manager.cc:GetSession:156] [scqlengine] session(bad8437a-7c7b-11ef-a9b6-0242ac170002) not exists. default return nullptr.
2024-09-27 02:54:49.555 [warning] [session.cc:ActiveLogger:336] [scqlengine] can not get valid session
2024-09-27 02:54:49.555 [info] [engine_service_impl.cc:StopJob:173] [scqlengine] EngineServiceImpl::StopJob(bad8437a-7c7b-11ef-a9b6-0242ac170002), reason()
2024-09-27 02:54:49.555 [warning] [session_manager.cc:StopSession:174] [scqlengine] session(bad8437a-7c7b-11ef-a9b6-0242ac170002) not exists.

@lanyy9527
Copy link

您的数据量是多大的?方便提供下两台机器的mem,另外两台机器之间的带宽延迟有没有做限制?

@friendsAI
Copy link
Author

两边数据量很小,都不到100条。机器的内存是64G。scql有设置带宽延迟的参数嘛?我没有在其它的地方限制这个。

@lanyy9527
Copy link

那有可能是mysql和engine连接的问题,可以贴下你的gflags.conf和docker-compose配置文件信息;

@friendsAI
Copy link
Author

friendsAI commented Sep 27, 2024

gflags.conf 文件配置如下:
--listen_port=8003
--datasource_router=embed
--enable_driver_authorization=false
--server_enable_ssl=false
--driver_enable_ssl_as_client=false
--peer_engine_enable_ssl_as_client=false
--embed_router_conf={"datasources": [{"id": "ds001", "name": "mysql db", "kind": "MYSQL", "connection_str": "db=bob;user=root;password=123456;host=192.168.90.171;auto-reconnect=true"}], "rules": [{"db": "", "table": "", "datasource_id": "ds001"}]}
--enable_self_auth=false
--enable_peer_auth=false
--peer_engine_protocol=http:proto
--peer_engine_connection_type=pooled
--spu_allowed_protocols=CHEETAH

docker-compose.yaml文件配置信息如下:
version: '3.8'
services:
broker:
image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/scql:latest
command:
- /home/admin/bin/broker
- -config=/home/admin/configs/config.yml
restart: always
ports:
- 8080:8080
- 8081:8081
volumes:
- ./config.yml:/home/admin/configs/config.yml
- ./party_info.json:/home/admin/configs/party_info.json
- ./ed25519key.pem:/home/admin/configs/ed25519key.pem
security_opt:
- seccomp:unconfined
engine:
cap_add:
- NET_ADMIN
command:
- /home/admin/bin/scqlengine
- --flagfile=/home/admin/engine/conf/gflags.conf
image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/scql:latest
ports:
- 8003:8003
volumes:
- ./gflags.conf:/home/admin/engine/conf/gflags.conf
security_opt:
- seccomp:unconfined

@lanyy9527
Copy link

可以尝试下在connection_str中添加mysql port信息,重启服务监控下 docker logs engine-name 日志信息

@friendsAI
Copy link
Author

engine 信息如下:
2024-09-27 10:43:16.686 [info] [main.cc:BuildRouter:108] [scqlengine] Building EmbedRouter from json conf
2024-09-27 10:43:16.687 [info] [session_manager.cc:WatchSessionTimeoutThread:250] [scqlengine] WatchSessionTimeoutThread startup, session default timeout=1800s
2024-09-27 10:43:16.687 [info] [thread_pool.cc:ThreadPool:30] [scqlengine] Create a fixed thread pool with size 32
2024-09-27 10:43:16.688 [info] [main.cc:main:330] [scqlengine] Adding EngineService into brpc server
2024-09-27 10:43:16.689 [info] [main.cc:main:339] [scqlengine] Adding ErrorCollectorService into brpc server
2024-09-27 10:43:16.689 [info] [main.cc:main:348] [scqlengine] Adding MetricsService into brpc server
2024-09-27 10:43:16.689 [info] [main.cc:main:358] [scqlengine] Adding MuxReceiverService into main server...
2024-09-27 10:43:16.692 [warning] [server.cpp:BRPC:1187] [scqlengine] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-09-27 10:43:16.693 [info] [main.cc:main:378] [scqlengine] Started engine rpc server success, listen on: 0.0.0.0:8003

@friendsAI
Copy link
Author

还是执行以前的run命令,仍旧执行不到500秒,报错。
Error: run query: DoQuery response: {
"status": {
"code": 320,
"message": "RunExecutionPlan create session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) failed, catch std::exception=[external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=b0abbc6f-7cbd-11ef-ad09-0242ac180003:2:ALLGATHER "
}
}

engine的日志:
2024-09-27 10:43:16.686 [info] [main.cc:BuildRouter:108] [scqlengine] Building EmbedRouter from json conf
2024-09-27 10:43:16.687 [info] [session_manager.cc:WatchSessionTimeoutThread:250] [scqlengine] WatchSessionTimeoutThread startup, session default timeout=1800s
2024-09-27 10:43:16.687 [info] [thread_pool.cc:ThreadPool:30] [scqlengine] Create a fixed thread pool with size 32
2024-09-27 10:43:16.688 [info] [main.cc:main:330] [scqlengine] Adding EngineService into brpc server
2024-09-27 10:43:16.689 [info] [main.cc:main:339] [scqlengine] Adding ErrorCollectorService into brpc server
2024-09-27 10:43:16.689 [info] [main.cc:main:348] [scqlengine] Adding MetricsService into brpc server
2024-09-27 10:43:16.689 [info] [main.cc:main:358] [scqlengine] Adding MuxReceiverService into main server...
2024-09-27 10:43:16.692 [warning] [server.cpp:BRPC:1187] [scqlengine] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-09-27 10:43:16.693 [info] [main.cc:main:378] [scqlengine] Started engine rpc server success, listen on: 0.0.0.0:8003
2024-09-27 10:46:03.647 [warning] [listener.cc:GetListener:62] [scqlengine] Listener for link_id:b0abbc6f-7cbd-11ef-ad09-0242ac180003 not exist.
2024-09-27 10:46:58.754 [warning] [session_manager.cc:GetSession:156] [scqlengine] session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) not exists. default return nullptr.
2024-09-27 10:46:58.757 [error] [engine_service_impl.cc:RunExecutionPlan:301] [scqlengine] RunExecutionPlan create session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) failed, catch std::exception=[external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=b0abbc6f-7cbd-11ef-ad09-0242ac180003:2:ALLGATHER
2024-09-27 10:46:58.758 [warning] [session_manager.cc:GetSession:156] [scqlengine] session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) not exists. default return nullptr.
2024-09-27 10:46:58.758 [warning] [session.cc:ActiveLogger:336] [scqlengine] can not get valid session
2024-09-27 10:47:00.771 [warning] [engine_service_impl.cc:ReportErrorToPeers:661] [scqlengine] sync error to peer=(1200001725521632516,192.168.90.171:5006) failed: status: code: 141
message: "no session for job_id=b0abbc6f-7cbd-11ef-ad09-0242ac180003"
broker 日志如下:
2024-09-27 10:47:00.773 [warning] [session_manager.cc:GetSession:156] [scqlengine] session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) not exists. default return nullptr.
2024-09-27 10:47:00.773 [warning] [session.cc:ActiveLogger:336] [scqlengine] can not get valid session
2024-09-27 10:47:00.773 [info] [engine_service_impl.cc:StopJob:173] [scqlengine] EngineServiceImpl::StopJob(b0abbc6f-7cbd-11ef-ad09-0242ac180003), reason()
2024-09-27 10:47:00.773 [warning] [session_manager.cc:StopSession:174] [scqlengine] session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) not exists.

2024-09-27 10:47:00.92710 ERROR executor.go:98 |RequestID:|SessionID:b0abbc6f-7cbd-11ef-ad09-0242ac180003|ActionName:EngineStub@RunExecutionPlan|CostTime:56.08233881s|Reason:InvalidResponse|ErrorMsg:Error: code=320, msg="RunExecutionPlan create session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) failed, catch std::exception=[external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=b0abbc6f-7cbd-11ef-ad09-0242ac180003:2:ALLGATHER "|Request:
2024-09-27 10:47:00.92710 ERROR common.go:139 |RequestID:|RequestParty:|SessionID:|ActionName:Intra@DoQuery|CostTime:58.126842499s|Reason:|ErrorMsg:runQuery Execute err: Error: code=320, msg="RunExecutionPlan create session(b0abbc6f-7cbd-11ef-ad09-0242ac180003) failed, catch std::exception=[external/yacl/yacl/link/transport/ debug_opts:{} job_config:{}

@lanyy9527
Copy link

尝试将 link_recv_timeout_ms 调大些试下;

@friendsAI
Copy link
Author

好的,我试一下。我看官方文档里说,这个默认值是30000,单位是毫秒嘛?

@lanyy9527
Copy link

好的,我试一下。我看官方文档里说,这个默认值是30000,单位是毫秒嘛?

是的,默认30s

@friendsAI
Copy link
Author

friendsAI commented Sep 29, 2024

image

我执行run 语句,它怎么报错的信息,是另一条sql 语句啊?这是你们预置的一条sql吗?

@lanyy9527
Copy link

查看下你创建数据表的命令,是否有指定使用这张表;

@friendsAI
Copy link
Author

image
并没有。这个项目里都没有这张表

@lanyy9527
Copy link

贴下 create table for bob 的命令看看

@friendsAI
Copy link
Author

哦,我查了,是我的问题。谢谢。超时哪个设置,我测试完了再回复,看设置link_recv_timeout_ms 是否能起作用。谢谢!

Copy link

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants