Skip to main content

Flink CDC + OceanBase integration solution for full and incremental synchronization

Introduction

Change Data Capture (CDC) is a widely applied technology that captures database changes. In this post, we will introduce you to the Flink CDC + OceanBase Database data integration solution. This solution combines CDC with extraordinary pipeline capabilities and diversified ecosystem tools of Flink, to synchronize processed CDC data to the downstream, formulating a solution for integrated full and incremental synchronization based on OceanBase Database Community Edition.

The solution brings two benefits. First, it synchronizes data by using one component and one link. Second, Flink SQL supports aggregation and extract-transform-load (ETL) of database and table shards, making it much easier for users to analyze, process, and synchronize CDC data by executing a Flink SQL job.

Keypoints

This post is based on the content shared by Wang He, an open source tool expert with OceanBase.

It contains the following five parts:

  1. Introduction to the CDC technology

  2. Introduction to OceanBase CDC components

  3. Introduction to Flink CDC

  4. Use Flink CDC OceanBase Connector

  5. Conclusion

I. CDC technology

The CDC technology monitors and captures changes in a database, such as the INSERT, UPDATE, and DELETE operations on the data or data tables, and then writes the changes to message-oriented middleware, so that other services can subscribe to and consume the changes.

Alibaba Canal is a popular open source CDC tool, which is mainly used in Alibaba Cloud open-source components for incremental MySQL data subscription and consumption. The latest version of Alibaba Canal supports OceanBase Database Community Edition data sources, incremental DDL and DML operations, and filtering of databases, tables, and columns. You can use it with ZooKeeper for the deployment of high-availability clusters. The client adapter of Alibaba Canal supports multiple types of containers as the destination. You can use it with Alibaba Otter to achieve active geo-redundancy.

Another popular open source CDC framework is Debezium.

It supports the synchronization of DDL and DML operation logs, uses the primary key or unique key as the key of the message body, and also supports the snapshot mode and full synchronization.

Debezium also supports a variety of data sources. You can integrate Debezium Server into a program as an embedded engine to directly write data to a message system without using Kafka.

II. OceanBase CDC components

OceanBase Database Community Edition provides four CDC components:

obcdc (formerly liboblog): pulls incremental logs in sequence.

oblogmsg: parses the format of incremental logs.

oblogproxy: pulls the incremental logs.

oblogclient: connects to oblogproxy to obtain the incremental logs.

OceanBase Migration Service (OMS) Community Edition is provided. It is an all-in-one data migration tool for incremental data migration, full data migration, and full data verification.

The preceding figure shows the CDC logic of OceanBase Database Community Edition. Data is pulled by oblogproxy and OMS Community Edition. Canal and Flink CDC are integrated with oblogclient to obtain incremental logs from oblogproxy.

III. Flink CDC

Flink CDC supports multiple data sources, such as MySQL, PostgreSQL, and Oracle. Flink CDC reads the full and incremental data from a variety of databases, and then automatically transfers data to the Flink SQL engine for processing.

Flink is a hybrid engine that supports both batch and streaming processing. Flink CDC converts streaming data into a dynamic table. In the preceding figure, the lower left part shows the mapping between streaming data and a dynamic table. The lower right part shows the results of multiple executions of continuous queries.

The preceding figure shows the working principle of Flink CDC. It implements the SourceFunction API based on Debezium and supports MySQL, Oracle, MongoDB, PostgreSQL, and SQLServer.

The latest version of Flink CDC supports data reads from a MySQL data source by using the Source API, which provides enhanced concurrent reading compared to the SourceFunction API.

The OceanBaseRichSourceFunction API is implemented for full and incremental data reads respectively based on JDBC and oblogclient.

=============

IV. Use Flink CDC OceanBase Connector

Configure the docker-compose.yml file and start the container. Go to the directory where the docker-compose.yml file is stored, and run the docker-compose up-d command to start the required components.

Run the docker-compose exec observer obclient-h127.0.0.1-P2881-uroot-ppsw command to log on by using newly created username and password. Download the required dependency packages and execute Flink DDL statements on the CLI of Flink SQL to create a table.

Set the checkpointing interval to 3 seconds and the local time zone to Asia/Shanghai. Then, create an order table, a commodity table, and the associated order data table. Perform data reads and writes.

View the data in Kibana by visiting the following address:

http://localhost:5601/app/kibana#/management/kibana/index_pattern

Create an index pattern named enriched_orders, and then view the written data by visiting http://localhost:5601/app/kibana#/discover.

Modify the data of the monitored table and view the incremental data changes. Perform the following modification operations in OceanBase Database in sequence, and refresh Kibana once after each step. We can see that the order data displayed in Kibana is updated in real time.

Clean up the environment. Go to the directory where the docker-compose.yml file is located, and run the docker-compose down command to stop all containers. Go to the Flink deployment directory and run the ./bin/stop-cluster.sh command to stop the Flink cluster.

V. Conclusion

Flink CDC supports full and incremental data migration between many types of data sources and works with Flink SQL to perform ETL operations on streaming data. As of the release of Flink CDC 2.2, the project has 44 contributors, 4 maintainers, and more than 4,000 community members.

OceanBase Connector can be integrated with Flink CDC 2.2 or later to read full data and incremental DML operations from multiple databases and tables in AT_LEAST_ONCE mode. Flink CDC OceanBase Connector will gradually support concurrent reads, incremental DDL operations, and the EXACTLY_ONCE mode in later versions.

Now, let's briefly compare the existing CDC solutions. OMS Community Edition is a proven online data migration tool with a GUI-based console. It provides full data migration, incremental data migration, data verification, and O&M services. DataX + Canal/Otter is a fully open source solution. Canal supports many types of destinations and incremental DDL operations, and Otter supports active-active disaster recovery.

Afterword

Flink CDC is a fully open source solution and is supported by an active community. It supports full and incremental data synchronization between many types of data sources and destinations. It is worth mentioning that Flink CDC is easy to use, and supports aggregation and ETL of database and table shards. Compared with some existing CDC solutions that involve complex data cleaning, analysis, and aggregation operations, Flink SQL allows users to easily process data for various business needs by using methods such as stream-stream join and dimension table join.

Contact us

Feel free to contact us at any time.

Visit the official forum of OceanBase Database Community Edition

Report an issue of OceanBase Database Community Edition

DingTalk Group ID: 33254054