0%

TiCDC is a change data capture framework developed for TiDB ecosystem. It allows for the replication of change data to various downstream systems, including MySQL protocol-compatible databases, messaging queue systems like Kafka, and Object Storage Service like S3. As the top one code contributor and the former tech leader of this project, I witnessed the project from a concept to a mature system running in many product environments, and also led several important architecture evolutions of this project. At this moment, I’m going to leave this project and explore new opportunities, I’d like to retrospect and summarize the lessons I have learned from this project.

Highlights of this system

In this part I will talk about the highlights of TiCDC, from the perspectives of system architecture, core features, smart design or some engineering implementations.

A well designed plug-and-play, monolithic architecture

TiCDC stands out due to its well-designed plug-and-play monolithic architecture. Setting up a synchronization path for replicating row-based change data from an existing TiDB cluster is a breeze. Users only need to deploy one or more TiCDC processes connected to the same TiDB cluster to form a unified cluster automatically, without any topology configuration, meta-system deployment, or separated components. Different compute components, including the data source part(puller), data sort part(sorter), data encapsulation part(mounter) and data sink part(sink) are running in the same process. That is what I mean it is a plug-and-play system, and it is a classic monolithic architecture. In addition to its straightforward deployment, TiCDC is designed to handle system maintenance effectively, especially for managing many high-availability scenarios automatically, including node crasing, upstream failure, network partition and so on. TiCDC provides high availability at both the process and data levels.

Read more »

This is a real case in my recent work with go-mysql client. In this scenario, there exits a MySQL connection pool(which maintains multiple database connections) that keeps writing transactions to peer databases, such as a MySQL or TiDB. Each transaction contains several insert, update or delete DMLs, the DML count ranges from 40 to 128. In order to reduce round trip between MySQL client and MySQL server, we adopt the multiple statements in transaction, which conbines multiple DMLs in a single SQL statement. A sample code snippet with go-sql-driver is as follows.

Read more »

When we diagnose performance issues of a running system, memory issue is always a must check item, which could affect latency, throughput, jitter from many aspects. This article focuses on memory issue in golang program, and summaries how to diagnose golang memory issues with the help of principles and toolchains.

Read more »

In recent days I created a new virtual machine with ubuntu 20.04, but I found the terminal command with sudo took a long time, almost 3 seconds each time. When I seach sudo command slow in Google, the first two search results are both from stackflow and suggest adding an entry 127.0.0.1 hostname to /etc/hosts file. I checked /etc/hosts and found the entry 127.0.0.1 hostname exists (in fact I made a mistake here, which I will describe later). This article describes the investigation into this issue in timeline.

Read more »

In recent work my team devoted in a cross region data replication solution, a simplified abstraction of the workload is as follows

The gRPC service and downstream MySQL are located in different AWS regions(such as one in EU, and the other one in US East), and the average network latency between these two regions is 70ms. We have a choice to deploy the data transfer service in either gRPC service side or downstream MySQL side. After a series of tests, we observed that deploying data transfer service in the downstream(the downstream MySQL) side has much larger throughput than deploying it in the upstream(the gRPC service) side. This article will analyze the root cause from both benchmark result and principle analysis, try to find potential solutions and give advice about such scenarios.

Read more »

Go runtime provides a convenient way to dump stack traces of all current goroutines via pprof, which can save developers a lot time to diagnose the programming problems including deadlock, goroutine pause(such as IO wait, blocking chan receive etc.), goroutine leak. In a long running go process, there could be thousands of goroutines, it takes some time to find the susceptible stack trace from large amount of lines, this article concludes some common methods to discovery suspectable stack trace/goroutine quickly.

Read more »

Latency is one of the most important metrics in system performance, different systems have various latency requirements, such as the read latency of a relation database maybe less than 50ms, the GC latency of a programming language should be less than 10ms or 1ms, while the latency requirement of two micro services under the same data center could be less than 0.2ms. It is not always latency sensitive in every single part of a system, but as a matter of fact there do exist many components that are latency sensitive and we must be very careful when we design and implement these components or systems.

A lot of articles have talked about system latency, from both the high level, macroscopic perspective, such as the latency of a complex architecture; the latency from systemic interaction such as http API invocation, database read and write, cache access; the latency of operation in programming language such as memory allocation or function call. And the low level, or the underlying system, such as the latency in memory access, IO access, TCP packet transmit, etc. The following latency table is from book Systems Performance, and the project napking-math also provides a table about latency numbers.

Read more »

Recently I met a context deadline exceeded error when using gRPC client to call DialContext to a gRPC server, which was in a customer’s internal environment. After some investigation I found out the root cause, the gRPC client was behind an HTTP proxy and the proxy has no permission to access the gRPC server. It is a common trouble shooting, however I am interested in how we can force a program to use http or socks proxy, I will dive into it and compare the pros and cons among different solutions.

Ways to enable a proxy

There exist many ways to let an application use a proxy, these ways can be classified into three types as follows, and I will talk about these methods briefly.

  • Explicit environment variable to active transport feature which is builtin in the application.
  • Use a hook method to hijack network calls from the application, without changing the program code.
  • Use a network packet hijacking way such as the kernel netfilter module to inspect and modify network packets.
Read more »

It is well known in golang that when allocating storage for a new variable and no explicit initialization is provided, the variable is given to a default value, either zero value for primitive types or nil for nullable types such as pointers, functions, maps etc. More details can be found in go sepc. Some of the nullable types can cause panic when accessing them without initialization. Some examples are as follows.

Read more »