Torque nodes left in down after error condition disappeared #13

vholer · 2018-01-04T20:38:21Z

Node ran out of disk space. In the Torque, it was switched to state down with an explanatory note. After the error condition disappeared, it didn't come back to up/free automatically. The pbs_mom on particular node had to be manually restarted to repair the state.

$ pbsnodes
node1.localdomain
     state = down
     power_state = Running
     np = 8
     ntype = cluster
     status = opsys=linux,uname=...,sessions=10062 10109 10274,nsessions=3,nusers=1,idletime=1444966,totmem=8010380kb,availmem=7536480kb,physmem=8010380kb,ncpus=8,loadave=0.00,message=ERROR: torque spool filesystem full,gres=,netload=15776691386,state=free,varattr= ,cpuclock=Fixed,version=6.1.1.1,rectime=1515096744,jobs=
     note = ERROR: torque spool filesystem full
     mom_service_port = 15002
     mom_manager_port = 15003

Consider proactive operation to fix stale states automatically.

Also, long-term down nodes should be monitored as part of #10.

The text was updated successfully, but these errors were encountered:

vholer added bug torque labels Jan 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torque nodes left in down after error condition disappeared #13

Torque nodes left in down after error condition disappeared #13

vholer commented Jan 4, 2018

Torque nodes left in down after error condition disappeared #13

Torque nodes left in down after error condition disappeared #13

Comments

vholer commented Jan 4, 2018