Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job is killed when num_passes is larger than 2 #608

Open
christianahui opened this issue Feb 7, 2018 · 1 comment
Open

Job is killed when num_passes is larger than 2 #608

christianahui opened this issue Feb 7, 2018 · 1 comment
Labels

Comments

@christianahui
Copy link

When I train my model on local, everything seems to be fine. After I submit my job to paddlecloud, It is killed if num_passes is larger than 2(num_passes is the parameter in trainer.train function)

num_passes is 2: seems ok
d56f4513-8bed-42bb-a7e8-b18ac3f590c4
num_passes is 3: job is killed after pass 2
867acd01b6107a43901ce92c6c7c4b24
num_passes is 4: job is killed after pass 3, job is only killed after the last pass
d85e695beda40997ffc5798429aaa1db

Besides, the log also shows: Failed trainer count beyond the threadhold: 0, what dose the "trainer count" mean? Do I need to specified this parameter in paddle.init() and how?
9c9c90dcaa4dcc06b0fbec79acfd2c97
Thank you so much!

@Yancey1989
Copy link
Collaborator

num_passes is 4: job is killed after pass 3, job is only killed after the last pass

Usually, it's caused by beyond the memory threshold which specified by submitting args -memory, please try to increase this.

Failed trainer count beyond the threadhold: 0, what dose the "trainer count"

The trainer count means the number of trainer nodes, this is a system logs, means that the training job will fail when the number of failed trainer node beyond the threshold(here is 0).

And also you don't need to specify any params in paddle.init, just check the reason for the failed trainer node.

@typhoonzero typhoonzero added the bug label Feb 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants