Replace Unhealthy ETCD Member in OpenShift 4.10

Ari Sukarno
4 min readFeb 8, 2024

--

Openshift Architecture

etcd (pronounced et-see-dee) is an open source, distributed, consistent key-value store for shared configuration, service discovery, and scheduler coordination of distributed systems or clusters of machines. etcd helps to facilitate safer automatic updates, coordinates work being scheduled to hosts, and assists in the set up of overlay networking for containers.

If all member of etcd are lost/unhealthy we won’t be able to make changes to the Kubernetes current state. No new pods will be scheduled, among many other problems. So, it’s important to keep our etcd are in healthy state.

Then, how to deal with our member etcd if become unhealthy due to some issues? We have to replace the unhealthy etcd member to resolve that issue before it bring more complex problem. As per Redhat, below there are some reaoson which make etcd become unhealthy state [1]:

  • The machine is not running or the node is not ready
  • The etcd pod is crashlooping

How to solve the issue is depend of the reason and it has different way to solve the issue. In my case, etcd are become unhealthy due to etcd pod is crash because the master are in Ready state.

Master Node State

And the etcd pod isn’t running properly

etcd-quorum pod

If we check the logs, it said that pod is unhealthy and probe was failed

etcd-quorum logs

Tried to observe list of the etcd members

# oc rsh -n openshift-etcd problematic-pod
# etcdctl endpoint health -w table
etcd-member state

It shown that one of the etcd member are in false state, hence we need to resolve by replacing the problematic etcd member.

Pre-requisites:

  • Verify the pod is crashing/not running properly
  • User has access as Cluster-Admin
  • Take etcd-backup

Let’s do it ! don’t forget to take your coffee :)

  1. Stop the crashing etcd pod by moving the /var/lib/etcd/ file
# oc debug node/node-master-problematic
# chroot /host
# mkdir /var/lib/etcd-backup
# mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup/
# mv /var/lib/etcd/ /tmp
stop the pod by moving the etcd file

2. Remove the unhealthy member

Connect to the etcd pod crashing and remove the unhealthy member

# oc rsh -n openshift-etcd pod-name
# etcdctl member list -w table
# etcdctl member remove etcd-member-id
remove unhealthy etcd member

3. Turn of the quorum guard by running below command

# oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'

This command ensure that we can re-create secret and rollout the new pod later

4. Remove the old secrets for the unhealthy etcd member that was removed

# oc get secrets -n openshift-etcd | grep master-node
Node Secrets

Remove all of those secret using below command

; delete etcd-peer secret 
# oc delete secret -n openshift-etcd etcd-peer-node-name

; delete etcd-serving
# oc delete secret -n openshift-etcd etcd-serving-node-name

; delete etcd-serving-metric
# oc delete secret -n openshift-etcd etcd-serving-metrics-node-name

5. Force etcd redeployment

By running below command, the secret and etcd pod will be redeploy

# oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge 

6. Turn back the quorum guard back

# oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null}}'

We can verify if unsupportedConfigOverrides was removed from etcd

# oc get etcd/cluster -oyaml

7. Verify the etcd pod

Wait for a few minutes because all the etcd will be redeployed.

# oc get pod -n openshift-etcd
quorum pod

Login to the pod and verify that all etcd member are in healthy state

# etcdctl endpoint health
etcd member already healthy

We can verify from the cluster operator status as well, it supposed to be not in degrade state. In this case, all the etcd member back to in healthy state and we are in safe condition with our ETCD. Remember, every condition will has different way, make sure you are in the same condition if want to replace using above step :D… Thank you

Reference:

[1] https://docs.openshift.com/container-platform/4.10/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-crashlooping-etcd-member_replacing-unhealthy-etcd-member

--

--

Ari Sukarno
Ari Sukarno

Written by Ari Sukarno

Cloud / DevOps / Site Reliability Engineer Things

No responses yet