Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ordered cluster shutdown #200

Open
jan--f opened this issue Apr 12, 2017 · 5 comments
Open

ordered cluster shutdown #200

jan--f opened this issue Apr 12, 2017 · 5 comments
Labels

Comments

@jan--f
Copy link
Contributor

jan--f commented Apr 12, 2017

add orderly shutdown of all ceph and gateway services.

@dmbyte
Copy link

dmbyte commented Apr 12, 2017

provide notification to any stateful protocols as we onboard them. (CIFS) and figure out how to block new incoming connections on the protocols during the shutdown timeout period (waiting for users to logoff).

@jan--f
Copy link
Contributor Author

jan--f commented Apr 13, 2017

A shutdown needs to go through a number of stages:

  1. Fence off all client protocols. That means prevent opening new sessions and ideally drain sessions, i.e. make sure existing sessions do not open new files or block ranges. For example for CephFS this means that cached (and potentially dirty) data is cleaned up properly.

  2. After a timeout evict all client connections.

  3. Stop all gateway nodes, that are not a dependency for other services. E.g. nfs-ganesha, samba, rgw, igw.

  4. Stop all leftover gateway services. MDS, ...?

  5. Stop OSDs and MONs

  6. Shutdown systems

Step 1 is the biggest hurdle here, since Ceph itself doesn't offer any help. There is client eviction for CephFS, but up until now no cleaning up is done (feature is tracked here http://tracker.ceph.com/issues/9754). Considering the number of gateway protocols this will be hard to do cleanly.

The other steps will be straight forward after this.

So how do we fence off all client gateways. CephFS might have a way soon. Samba, ganesha, rgw, rbd, igw I don't know.

@swiftgist
Copy link
Contributor

Considering that each service will likely have its own challenges on preventing client connections and terminating clients gracefully, is the first step to return a list of existing connections to the admin? That could serve as a validation check that prevents the remaining steps unless the admin specifically overrides and forces the shutdown. I would think this is in the same spirit of shutting down a host. The admin ultimately decides whether to visit and stop each remote connection or hold the power button.

If the default behavior of this orchestration returns a list of active connections with enough information about their sources, I believe the admin may have other options in disconnecting clients. Otherwise, the shutdown proceeds when there is no active connections.

Eventually, this would evolve into every service preventing/terminating clients gracefully, but would you think this is sufficient until then?

@dmbyte
Copy link

dmbyte commented Apr 13, 2017 via email

@Martin-Weiss
Copy link

Is there any progress on this feature request for DeepSea?

FYI - customers that have a UPS/USV and want to shutdown the cluster in a controlled way would really like a deepsea command for doing that...

In the meantime this process is also documented here:
https://www.suse.com/documentation/suse-enterprise-storage-5/singlehtml/book_storage_admin/book_storage_admin.html#sec.salt.cluster.reboot

Could we get a command added that does this using DeepSea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants