Skip to main content

Recover from node failure

This topic describes how to recover from the failures of OBServer nodes by using ob-operator.

note

Before OceanBase 4.2.3.0, the database kernel cannot communicate using virtual IP addresses. When the Pod IP address changes, the observer cannot start normally. To restart the observer in place after it fails, you must fix the IP address of the node. Otherwise, you can only rely on the majority of nodes to add a new Pod to the cluster as a new node and synchronize data to restore the original number of nodes.

Based on the multi-replica capability of OceanBase Database

Prerequisites

  • To successfully recover the OceanBase cluster, you must deploy at least three nodes and a tenant with at least three replicas.
  • This method can only handle the failure of a minority of nodes, such as the failure of one node in a three-node cluster.

Recovery policy

When an OBServer node fails, ob-operator detects the pod exception and creates a new OBServer node to join the cluster. The new OBServer node synchronizes data with the original node until all data is synchronized.

During the recovery process, if a majority of OBServer nodes fail, the cluster cannot be restored. In this case, you must manually restore the cluster by using the backup and restore feature of ob-operator.

Based on the Calico network plugin

Prerequisites

  • To use the static IP address feature, you must install Calico as the network plugin for the Kubernetes cluster.

Restore policy

When a minority of OBServer nodes fail, the multi-replica mechanism of OceanBase Database ensures the availability of the cluster, and ob-operator detects a pod exception. Then, ob-operator creates an OBServer node, adds it to the cluster, and deletes the abnormal OBServer node. OceanBase Database replicates the data of replicas on the abnormal OBServer node to the new node.

If you have installed Calico for the Kubernetes cluster, this process can be easier. ob-operator can start a new observer process by using the IP address of the abnormal OBServer node. This way, the data on the abnormal OBServer node, if the data still exists, can be directly used without the replication step. Moreover, if a majority of OBServer nodes fail, this method can also restore the service after all new OBServer nodes are started.

Based on Kubernetes service

Prerequisites

  • Only OceanBase Database of version >= 4.2.3.0 support this feature.
  • Create an OBCluster in service mode. The configuration is as follows:
apiVersion: oceanbase.oceanbase.com/v1alpha1
kind: OBCluster
metadata:
name: test
namespace: oceanbase
annotations:
oceanbase.oceanbase.com/mode: service # This is the key configuration
spec:
# ...

Restore policy

After creating an OBCluster in service mode, ob-operator will attach a service to each OBServer pod and take the ClusterIP of the service as the networking IP.

  1. When an OBServer pod restarts, the observer can restart in place by using the constant ClusterIP of the service for communication.
  2. When an OBServer pod is mistakenly deleted, ob-operator will create a new OBServer pod and use the same ClusterIP for communication. The new node will automatically join the OceanBase cluster and resume service.

Verification

You can verify the restore result of ob-operator by performing the following steps.

  1. Delete the pod:
kubectl delete pod obcluster-1-zone1-074bda77c272 -n oceanbase
  1. View the restore result. The output shows that the pod of zone1 has been created and is ready.
kubectl get pods -n oceanbase

NAME READY STATUS RESTARTS AGE
obcluster-1-zone3-074bda77c272 2/2 Running 0 12d
obcluster-1-zone2-7ecbd89f84de 2/2 Running 0 12d
obcluster-1-zone1-94ecf05cb290 2/2 Running 0 1m

Deployment suggestions

To deploy a production cluster with high availability, we recommend that you deploy an OceanBase cluster of at least three OBServer nodes, create tenants that each has at least three replicas, distribute nodes of each zone on different servers, and use the network plug-in Calico (or set OBCluster as service mode in ob-operator 2.2.0 and later versions). This minimizes the risk of an unrecoverable cluster disaster. ob-operator also provides high-availability solutions based on the backup and restore of tenant data and the primary and standby tenants. For more information, see Restore data from a backup and Back up a tenant.