Skip to content

Commit

Permalink
docs: added docs/devops.md
Browse files Browse the repository at this point in the history
  • Loading branch information
lni committed Jun 18, 2019
1 parent 8be738f commit a832193
Show file tree
Hide file tree
Showing 5 changed files with 26 additions and 13 deletions.
2 changes: 1 addition & 1 deletion README.CHS.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ go build -v -tags="dragonboat_no_rocksdb" pkgname

## 文档与资料 ##

首先建议您阅读项目的[综述文档](docs/overview.CHS.md)
首先建议您阅读项目的[综述文档](docs/overview.CHS.md)[运维注意事项](docs/devops.CHS.md)

欢迎阅读[godoc文档](https://godoc.org/github.com/lni/dragonboat)[中文例程](https://github.com/lni/dragonboat-example)[常见问题](https://github.com/lni/dragonboat/wiki/FAQ)[CHANGELOG](CHANGELOG.md)和在线[讨论组](https://gitter.im/lni/dragonboat)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ go build -v -tags="dragonboat_no_rocksdb" pkgname
```

### Documents ###
[FAQ](https://github.com/lni/dragonboat/wiki/FAQ), [docs](https://godoc.org/github.com/lni/dragonboat), step-by-step [examples](https://github.com/lni/dragonboat-example), [CHANGELOG](CHANGELOG.md) and [online chat](https://gitter.im/lni/dragonboat) are available.
[FAQ](https://github.com/lni/dragonboat/wiki/FAQ), [docs](https://godoc.org/github.com/lni/dragonboat), step-by-step [examples](https://github.com/lni/dragonboat-example), [DevOps doc](docs/devops.md), [CHANGELOG](CHANGELOG.md) and [online chat](https://gitter.im/lni/dragonboat) are available.

C++ Binding info can be found [here](https://github.com/lni/dragonboat/blob/master/binding/README.md).

Expand Down
12 changes: 12 additions & 0 deletions docs/devops.CHS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# 运维注意事项 #

本文档描述使用dragonboat的应用在部署上线以后的运维注意事项。请注意,不正确的运维方式将可能直接导致数据的永久性毁损。

* 建议使用ext4文件系统,其它文件系统未做任何测试。
* 必须使用本地硬盘,建议使用高写入寿命的企业级NVME固态硬盘,避免使用NFS、CIFS、Samba或Ceph等任何形式网络共享存储方式。
* Dragonboat所生成存储的数据绝不可通过直接文件、目录拷贝覆盖操作来进行备份与恢复。这将永久性损坏所涉及的Raft组。
* 每个Raft组已有多份副本,增加Raft组的副本数量是避免因部分节点故障失效而带来服务不可用及数据丢失的最好解决途径。比如,5个副本允许至少2个节点同时发生故障,它较3副本带来更高的数据安全与系统高可用性保障。
* 在个别节点发生故障后,如果多数派Quorum依旧存在、Raft组依旧可用,应该首先增加一个Observer节点以开始立即同步Raft组状态,待同步完成后立刻将其升级为普通节点,然后通过成员变更移除故障节点。对于间歇性的可立即恢复的硬件故障,比如短暂的网络分区或系统掉电,可立即试图修复故障机器。
* 如发生磁盘损坏,比如发生磁盘相关数据的校验错,在排除系软件故障引起后,应立即停止该节点并替换磁盘,并通过上述组成员变更方法替换已永久故障的节点。如需重启已确认为永久故障的节点,应确保已通过磁盘替换方式完全清空所有Dragonboat数据。
* 在极端情况下,当多数节点同时永久故障并无法修复时,Raft组将不可用。此时须使用github.com/lni/dragonboat/tools包提供的ImportSnapshot工具修复受损的Raft组。这需要用户日常定期使用NodeHost的ExportSnapshot方法导出并备份快照供此灾备用途。
* 用户应该自行测试系统是否具备高可用性,并测试在不同数量规模与组合的节点失效情况下用户系统是否可以正确处理并保持Raft组的高可用。各类灾备维护应该是日常CI的一部分。
12 changes: 12 additions & 0 deletions docs/devops.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# DevOps #

This document describes the DevOps requirements for operating Dragonboat based applications in production. Please note that incorrect DevOps operations can potentially corrupt your Raft clusters permanently.

* It is recommended to use the ext4 filesystem, other filesystems have never been tested.
* It is recommended to use enterprise NVME SSD with high write endurance rating. Must use local hard disks and avoid any NFS, CIFS, Samba, CEPH or other similar shared storage.
* Never try to backup or restore Dragonboat data by directly operating on Dragonboat data files or directories. It can immediately corrupt your Raft clusters.
* Each Raft group has multiple replicas, the best way to safeguard the availability of your services and data is to increase the number of replicas. As an example, the Raft group can tolerant 2 node failures when there are 5 replicas, while it can only tolerant 1 node failure when using 3 replicas.
* On node failure, the Raft group will be available when it still has the quorum. To handle such failures, you can add an Observer node to start replicating data to it, once in sync with other replicas you can promote the Observer to a regular node and remove the failed node by using membership change APIs. For those failed nodes caused by intermittent failures such as short term network partition or power loss, you should resolve the network or power issue and try restarting the affected nodes.
* On disk failure, such as when experiencing data integrity check errors or write failures, it is important to immediately replace the failed disk and remove the failed node using the above described membership change method.
* When the quorum nodes are gone, you will not be able to resolve it without losing data. The github.com/lni/dragonboat/tools package provides the ImportSnapshot method to import a previously exported snapshot to repair such failed Raft cluster.
* Always test your system to ensure that it has high availability by design, disaster recovery should always be a part of the CI.
11 changes: 0 additions & 11 deletions docs/overview.CHS.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,14 +133,3 @@ Dragonboat通过NodeHost提供下列其它常用功能:
Leader迁移。正常情况下,Leader以选举方式由用户程序透明的方式选举产生。用户可以使用NodeHost提供的RequestLeaderTransfer方法尝试将Leader迁移至指定节点。

NodeHost同时提供GetNodeHostInfo与GetClusterMembership方法供查询当前各NodeHost管理下的各Raft组信息。

## 运维注意事项 ##

* 建议使用ext4文件系统,其它文件系统未做任何测试。
* 必须使用本地硬盘,建议使用高写入寿命的企业级NVME固态硬盘,避免使用NFS、CIFS、Samba或Ceph等任何形式网络共享存储方式。
* Dragonboat所生成存储的数据绝不可通过直接文件、目录拷贝覆盖操作来进行备份与恢复。这将永久性损坏所涉及的Raft组。
* 每个Raft组已有多份副本,增加Raft组的副本数量是避免因部分节点故障失效而带来服务不可用及数据丢失的最好解决途径。比如,5个副本允许至少2个节点同时发生故障,它较3副本带来更高的数据安全与系统高可用性保障。
* 在个别节点发生故障后,在多数派Quorum依旧存在、Raft组依旧可用的情况下,应该首先增加一个Observer节点以开始立即同步Raft组状态,待同步完成后立刻将其升级为普通节点,然后通过成员变更移除故障节点。对于间歇性的可立即恢复的硬件故障,比如短暂的网络分区或系统掉电,可立即试图修复故障机器。
* 如发生磁盘损坏,比如发生磁盘相关数据的校验错,在排除系软件故障引起后,应立即停止该节点并替换磁盘,并通过组成员变更替换已永久故障的节点,永久故障节点全部移除后重启该机,此时应确保已通过磁盘替换方式完全清空Dragonboat数据。
* 在极端情况下,当多数节点同时永久故障并无法修复时,Raft组将不可用。此时须使用github.com/lni/dragonboat/tools包提供的ImportSnapshot工具修复受损的Raft组。这需要用户日常定期使用NodeHost的ExportSnapshot方法导出并备份快照供此灾备用途。该手段仅在极端情况下需要,我们自用的Dragonboat项目在过去18个月从未需要经此步骤有损数据地修复Raft组。
* 用户应该自行测试系统是否具备高可用性,并测试在不同数量规模与组合的节点失效情况下用户系统是否可以正确处理并保持Raft组的高可用。各类灾备维护应该是日常CI的一部分。

0 comments on commit a832193

Please sign in to comment.