Perform a Canary Rollout in Openshift Cluster 4 During Upgrade or Maintenance Window

5 min readJul 5, 2024

How is the canary rollout working in Openshift 4?

Depending on your organization’s requirements, might be you want to update on the particular node in your Openshift Cluster, monitor the cluster and workload health over a period, and then proceed with updating the remaining nodes. This approach is often known as a canary update.

Using this approach we have the capability to control on which node we’ll update first before going to another node. For example, if we have a cluster with 100 nodes and have limit maintenance windows time, by leveraging this canary rollout we can achieve our goal.

Let’s make a simple simulation with 100 node, we can divide 4 Machine Config Pool (MCP), again it’s depend with the requirement.

workerpool-canary-1 → Node 1–25 (Day 1 Maintenance Window)
workerpool-canary-2 → Node 26–50 (Day 2 Maintenance Window)
workerpool-canary-3 → Node 51–75 (Day 3 Maintenance Window)
workerpool-canary-4 → Node 76–100 (Day 4 Maintenance Window)

We can update in particular worker node per each Day by pause and unpause the MCP based on the appropriate time. Now, maybe you have thinking why we need this canary rollout by creating MCP!? Good :) Basically, all the worker node will be added in the default MCP called “worker”, so every update in the node will be implement based on the MCP and the node will be take randomly. Let’s say if your application is having High Availability (HA) by separate cluster in the worker node. So, you can’t control the HA during upgrade and it will lead a downtime in your application.

What is Machine Config Pool (MCP) ?

OpenShift Container Platform 4 is designed with an operators. On Red Hat CoreOS, the Machine Config Operator (MCO) is responsible for the operating system, overseeing OS updates and configuration changes. The MCO ensures the operating system is managed and keeps the cluster updated and properly configured. With MCO, we can configure and update components such as systemd, cri-o/kubelet, kernel, and NetworkManager on the nodes. To do that, the MCO creates a statically rendered MachineConfig file which includes the MachineConfigs for each node. It then applies that configuration to each node.

The Machine Config Pool (MCO) is similarly to the Rolebinding object, which links roles to users. The Machine Config Pool connects nodes with Machine Configs, mapping between them. It has two selectors, each of which matches machine configs to nodes.

Hands-on to implement Canary Rollout

This gonna be simple story but I hope it will help your journey.

Don’t forget to take your coffee before started :)

Observe current worker node and cluster version

$ oc get node
$ oc get clusterversion

2. Get current Machine Config Pool (MCP)

$ oc get mcp
$ oc get node | grep wrk | awk '{print $1}'

3. Create custom MCP

Define the custom mcp based on the role in the below yaml file.

Here, we defined the using role named worker-canary, so we every worker node which has the label worker-canary will be added to this custom mcp.

$ $ vi worker-canary.yaml 

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-canary
spec:
  machineConfigSelector:
    matchExpressions: 
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,worker-canary]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-canary: "" 
  maxUnavailable: 1  
  paused: true


$ oc create -f worker-canary.yaml

4. Add the worker node to the custom MCP

Now, we can add our worker node to the custom MCP by adding the role label to the respective node as our needed.

oc label node <node-name> node-role.kubernetes.io/worker-canary=
oc label node <node-name> node-role.kubernetes.io/worker-canary=
oc label node <node-name> node-role.kubernetes.io/worker-canary=
oc label node <node-name> node-role.kubernetes.io/worker-canary=

Once we add the node it automatically added to the mcp.

oc get node
NAME                 STATUS   ROLES                  AGE   VERSION
<master-node-name>   Ready    master                 12d   v1.25.16+054f0ba
<master-node-name>   Ready    master                 12d   v1.25.16+054f0ba
<master-node-name>   Ready    master                 12d   v1.25.16+054f0ba
<worker-node-name1>  Ready    worker                 12d   v1.25.16+054f0ba
<worker-node-name2>  Ready    worker                 12d   v1.25.16+054f0ba
<worker-node-name3>  Ready    worker,worker-canary   12d   v1.25.16+054f0ba

As you can see, the last worker node already has new role name with worker-canary. Let’s verify the current MCP.

$ oc get mcp 
NAME              CONFIG                                                          UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master            rendered-master-31c428c571eb3asdfvdf1c1e3746e64a119             True      False      False      3              3                   3                     0                      559d
worker            rendered-worker-b074bbadd7da8f8wewqaew4821c3c3a07809            True      False      False      2              2                   2                     0                      480d
worker-canary     rendered-worker-canary-b074bbadd7da8f82ad482sdadslk83a07809     True      False      False      1              0                   0                     0                      559d

If we see from the MCP, the READYMACHINECOUNT in our new MCP is still zero, because we paused the update in the yaml file. Hence, the node isn’t automatically sync with the new MCP. We can continue unpause our custom MCP but make sure there is no update in the cluster so it won’t take any other update in the node except syncing only.

$ oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/worker-canary

<wait a few minutes to sync>

Once it’s done, better to pause the mcp to avoid any auto update in the node. So we have our capability to handle the worker node update.

$ oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/worker-canary

5. Verify the MCP

If we see all the node already updated and make sure there is no DEGRADE in the node list, so whenever we continue to update in our worker node it won’t be update automatically.

$ oc get mcp 
NAME              CONFIG                                                          UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master            rendered-master-31c428c571eb3asdfvdf1c1e3746e64a119             True      False      False      3              3                   3                     0                      559d
worker            rendered-worker-b074bbadd7da8f8wewqaew4821c3c3a07809            True      False      False      2              2                   2                     0                      480d
worker-canary     rendered-worker-canary-b074bbadd7da8f82ad482sdadslk83a07809     True      False      False      1              1                   1                     0                      559d

You can continue to create another MCP based on your requirement.

Thank you :)

References:

https://docs.openshift.com/container-platform/4.10/updating/update-using-custom-machine-config-pools.html

https://www.redhat.com/en/blog/openshift-container-platform-4-how-does-machine-config-pool-work

Perform a Canary Rollout in Openshift Cluster 4 During Upgrade or Maintenance Window

Written by Ari Sukarno

No responses yet