Skip to content

ovn central Pod 无法正常运行

oilbeater edited this page Jun 27, 2022 · 2 revisions

Wiki 下的中文文档将不在维护,请访问我们最新的中文文档网站,获取最新的文档更新。

对于 v1.0 前版本,OVN 的 RAFT 模式存在机器重启后无法重新恢复的问题,建议升级至 1.0 后的版本。在 1.0 前版本进行修复需要重启启动所有容器网络下的 Pod,并且重启机器仍然有概率出现该问题。

检查对应机器的 /var/log/openvswitch/ovsdb-server-nb.log 和 /var/log/openvswitch/ovsdb-server-sb.log,如果有下列类似的 ERR log 即可确认为是 OVN Raft 实现导致的问题

2020-05-15T02:52:55.703Z|00335|raft|ERR|Dropped 15 log messages in last 13 seconds (most recently, 2 seconds ago) due to excessive rate
2020-05-15T02:52:55.703Z|00336|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53161 last_log_index=61188 last_log_term=52219
2020-05-15T02:53:06.803Z|00337|raft|ERR|Dropped 15 log messages in last 11 seconds (most recently, 2 seconds ago) due to excessive rate
2020-05-15T02:53:06.803Z|00338|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53169 last_log_index=61188 last_log_term=52219
2020-05-15T02:53:18.409Z|00339|raft|ERR|Dropped 13 log messages in last 12 seconds (most recently, 2 seconds ago) due to excessive rate
2020-05-15T02:53:18.409Z|00340|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53176 last_log_index=61188 last_log_term=52219
2020-05-15T02:53:30.920Z|00341|raft|ERR|Dropped 15 log messages in last 12 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-15T02:53:30.920Z|00342|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53184 last_log_index=61188 last_log_term=52219

修复方式:

  1. 记录 ovn-central 和 kube-ovn-controller 当前的实例数和所在机器,停掉 ovn-central 和 kube-ovn-controller
kubectl scale deployment -n kube-ovn --replicas=0 ovn-central
kubectl scale deployment -n kube-ovn --replicas=0 kube-ovn-controller
  1. 删除所有 ovn-central 所在机器 /etc/origin/openvswitch 目录下的 db 数据
rm -rf /etc/origin/openvswitch/ovnnb_db.db  
rm -rf /etc/origin/openvswitch/ovnsb_db.db 
  1. 备份并删除 metis 的 webhook 避免 pod 无法创建
kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io metis -o yaml metis.yaml
kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io metis
  1. scale ovn-central 至之前的实例数,并等待所有Pod ready
kubectl scale deployment -n kube-ovn ovn-central --replicas=<XXX>
  1. sclae kube-ovn-controller 至之前的实例数,并等待所有 Pod Ready
kubectl scale deployment -n kube-ovn kube-ovn-controller --replicas=<XXX>
  1. 删除所有容器网络模式下的 Pod 进行重建
for ns in $(kubectl get ns --no-headers -o  custom-columns=NAME:.metadata.name); do
  for pod in $(kubectl get pod --no-headers -n "$ns" --field-selector spec.restartPolicy=Always -o custom-columns=NAME:.metadata.name,HOST:spec.hostNetwork | awk '{if ($2!="true") print $1}'); do
    kubectl delete pod "$pod" -n "$ns"
  done
done
  1. 重建 metis 的 webhook 删除掉 metis.yaml 中的 createionTimestamp,resourceVersion,seflLink,uid 等字段,进行重建
kubectl apply -f metis.yaml
Clone this wiki locally