You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I train my model on local, everything seems to be fine. After I submit my job to paddlecloud, It is killed if num_passes is larger than 2(num_passes is the parameter in trainer.train function)
num_passes is 2: seems ok
num_passes is 3: job is killed after pass 2
num_passes is 4: job is killed after pass 3, job is only killed after the last pass
Besides, the log also shows: Failed trainer count beyond the threadhold: 0, what dose the "trainer count" mean? Do I need to specified this parameter in paddle.init() and how?
Thank you so much!
The text was updated successfully, but these errors were encountered:
num_passes is 4: job is killed after pass 3, job is only killed after the last pass
Usually, it's caused by beyond the memory threshold which specified by submitting args -memory, please try to increase this.
Failed trainer count beyond the threadhold: 0, what dose the "trainer count"
The trainer count means the number of trainer nodes, this is a system logs, means that the training job will fail when the number of failed trainer node beyond the threshold(here is 0).
And also you don't need to specify any params in paddle.init, just check the reason for the failed trainer node.
When I train my model on local, everything seems to be fine. After I submit my job to paddlecloud, It is killed if num_passes is larger than 2(num_passes is the parameter in trainer.train function)
num_passes is 2: seems ok
num_passes is 3: job is killed after pass 2
num_passes is 4: job is killed after pass 3, job is only killed after the last pass
Besides, the log also shows: Failed trainer count beyond the threadhold: 0, what dose the "trainer count" mean? Do I need to specified this parameter in paddle.init() and how?
Thank you so much!
The text was updated successfully, but these errors were encountered: