ordered cluster shutdown #200

jan--f · 2017-04-12T19:59:01Z

add orderly shutdown of all ceph and gateway services.

dmbyte · 2017-04-12T21:10:17Z

provide notification to any stateful protocols as we onboard them. (CIFS) and figure out how to block new incoming connections on the protocols during the shutdown timeout period (waiting for users to logoff).

jan--f · 2017-04-13T05:47:06Z

A shutdown needs to go through a number of stages:

Fence off all client protocols. That means prevent opening new sessions and ideally drain sessions, i.e. make sure existing sessions do not open new files or block ranges. For example for CephFS this means that cached (and potentially dirty) data is cleaned up properly.
After a timeout evict all client connections.
Stop all gateway nodes, that are not a dependency for other services. E.g. nfs-ganesha, samba, rgw, igw.
Stop all leftover gateway services. MDS, ...?
Stop OSDs and MONs
Shutdown systems

Step 1 is the biggest hurdle here, since Ceph itself doesn't offer any help. There is client eviction for CephFS, but up until now no cleaning up is done (feature is tracked here http://tracker.ceph.com/issues/9754). Considering the number of gateway protocols this will be hard to do cleanly.

The other steps will be straight forward after this.

So how do we fence off all client gateways. CephFS might have a way soon. Samba, ganesha, rgw, rbd, igw I don't know.

swiftgist · 2017-04-13T11:55:14Z

Considering that each service will likely have its own challenges on preventing client connections and terminating clients gracefully, is the first step to return a list of existing connections to the admin? That could serve as a validation check that prevents the remaining steps unless the admin specifically overrides and forces the shutdown. I would think this is in the same spirit of shutting down a host. The admin ultimately decides whether to visit and stop each remote connection or hold the power button.

If the default behavior of this orchestration returns a list of active connections with enough information about their sources, I believe the admin may have other options in disconnecting clients. Otherwise, the shutdown proceeds when there is no active connections.

Eventually, this would evolve into every service preventing/terminating clients gracefully, but would you think this is sufficient until then?

dmbyte · 2017-04-13T12:49:08Z

What I have seen in other storage systems is a timeout value. You issue the shutdown command and it then asks how many seconds to wait for connections to close. If they aren't all closed at the timeout, too bad. The shutdown can be interrupted by Ctrl-c. As for fencing, you can use the firewall in a worst case. Basically at time of command issue, create a rule set that only allows active connections. Recheck the active connections every 5 seconds and remove allowed hosts that are no longer connected. Sent from my iPhone. Typos are Apple's fault.

…

On Apr 13, 2017, at 6:55 AM, Eric Jackson ***@***.***> wrote: Considering that each service will likely have its own challenges on preventing client connections and terminating clients gracefully, is the first step to return a list of existing connections to the admin? That could serve as a validation check that prevents the remaining steps unless the admin specifically overrides and forces the shutdown. I would think this is in the same spirit of shutting down a host. The admin ultimately decides whether to visit and stop each remote connection or hold the power button. If the default behavior of this orchestration returns a list of active connections with enough information about their sources, I believe the admin may have other options in disconnecting clients. Otherwise, the shutdown proceeds when there is no active connections. Eventually, this would evolve into every service preventing/terminating clients gracefully, but would you think this is sufficient until then? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Martin-Weiss · 2018-06-20T09:51:15Z

Is there any progress on this feature request for DeepSea?

FYI - customers that have a UPS/USV and want to shutdown the cluster in a controlled way would really like a deepsea command for doing that...

In the meantime this process is also documented here:
https://www.suse.com/documentation/suse-enterprise-storage-5/singlehtml/book_storage_admin/book_storage_admin.html#sec.salt.cluster.reboot

Could we get a command added that does this using DeepSea?

swiftgist added the feature label Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ordered cluster shutdown #200

ordered cluster shutdown #200

jan--f commented Apr 12, 2017

dmbyte commented Apr 12, 2017

jan--f commented Apr 13, 2017

swiftgist commented Apr 13, 2017

dmbyte commented Apr 13, 2017 via email

Martin-Weiss commented Jun 20, 2018

ordered cluster shutdown #200

ordered cluster shutdown #200

Comments

jan--f commented Apr 12, 2017

dmbyte commented Apr 12, 2017

jan--f commented Apr 13, 2017

swiftgist commented Apr 13, 2017

dmbyte commented Apr 13, 2017 via email

Martin-Weiss commented Jun 20, 2018