Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torque nodes left in down after error condition disappeared #13

Open
vholer opened this issue Jan 4, 2018 · 0 comments
Open

Torque nodes left in down after error condition disappeared #13

vholer opened this issue Jan 4, 2018 · 0 comments

Comments

@vholer
Copy link
Contributor

vholer commented Jan 4, 2018

Node ran out of disk space. In the Torque, it was switched to state down with an explanatory note. After the error condition disappeared, it didn't come back to up/free automatically. The pbs_mom on particular node had to be manually restarted to repair the state.

$ pbsnodes
node1.localdomain
     state = down
     power_state = Running
     np = 8
     ntype = cluster
     status = opsys=linux,uname=...,sessions=10062 10109 10274,nsessions=3,nusers=1,idletime=1444966,totmem=8010380kb,availmem=7536480kb,physmem=8010380kb,ncpus=8,loadave=0.00,message=ERROR: torque spool filesystem full,gres=,netload=15776691386,state=free,varattr= ,cpuclock=Fixed,version=6.1.1.1,rectime=1515096744,jobs=
     note = ERROR: torque spool filesystem full
     mom_service_port = 15002
     mom_manager_port = 15003

Consider proactive operation to fix stale states automatically.

Also, long-term down nodes should be monitored as part of #10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant