Overview

We have been updating OceanBase Advanced Tutorial for DBAs for some time, and now it's time to talk about troubleshooting.

Some relevant documents are already available for your reference, such as the blog Quick Fixes for OceanBase Issues written by an OceanBase engineer and the Emergency response chapter in the official documentation of OceanBase Database.

However, the scenarios and solutions provided in these documents are not systematic or intuitive enough for users of OceanBase Database Community Edition.

Additionally, considering suggestions from community users such as @oceanvoice and @Zhang Yuqi, we will provide a more systematic and comprehensive troubleshooting manual in this advanced tutorial.

The manual will summarize issues that may occur when you use OceanBase Database and provide corresponding solutions. While these issues may not be common, they can have a serious impact and require database administrators (DBAs) to conduct preliminary analysis and take immediate corrective measures. The contents are likely to be as follows:

Overview
Slow response
High CPU load
Node failures
Hardware and infrastructure exceptions
- Network jitter
- Disk issues
Exceptions caused by load changes
- Full log/data disk
...

The following figure shows the troubleshooting mind map of OceanBase Database.

What's More

This topic provides only an overview of the troubleshooting manual and does not contain much in-depth content, for which we apologize. To make up for this, we provide the following common method for quick recovery in the event of serious OceanBase faults, especially when services are interrupted. Briefly speaking, you can take the following steps:

If only one tenant is faulty in a cluster, execute the ALTER TENANT [SET] PRIMARY_ZONE statement to change the primary zone of the tenant for leader switchover.
If only one node is faulty in a cluster, stop or isolate the node.
If all nodes are faulty in a cluster, restart the nodes one by one.
If issues persist after all nodes are restarted in the cluster, perform a failover to switch the standby cluster to the primary role.
Always analyze faults only after they have been handled and the service has been recovered.

Coming Soon

If you encounter cluster issues that are not covered in this manual, you need to contact engineers on duty in the OceanBase community forum to obtain technical support in most cases. As O&M staff, you do not need to further understand the issues. Therefore, these issues are assigned lower priorities. After the advanced tutorial is completed, we will add relevant content in the "Official Selection" module of the OceanBase community forum.

sys tenant or RootService exceptions
Memory leak
Disk space leak
Long-running transaction
Suspended transaction
Core dump
...

What's More​

Coming Soon​

References​

What's More

Coming Soon

References