This article will first talk about a wrong use case with etcd lease, and then based on that case, I will dig into the design principle of etcd lease and some use scenario with etcd lease.
There exist two roles in the following scenario, one is client and the other one is coordinator. More than one clients could exist at the same time. Each of them has an unique ID, and when a new client starts it applies a new lease from etcd and puts a key corresponding to the client ID with lease as option. The client must call
KeepAlive of its lease periodically to keep lease not timeout. Once the lease is timeout and deleted by etcd server, the client becomes illegal and should not access the etcd resource anymore. The coordinator monits the client ID key, when a new client registers, it allocates new resource/task to this client, which can be represented by a client ID relevant key value. And when it detects client ID key deleted (which means the client lease timeout), it will recycle the allocated resource of this client.
Here we use a etcd session to maintain client lease and lease keepalive, a simple work model of client is as follows
session, err := concurrency.NewSession(etcdCli, concurrency.WithTTL(10))
In the above code,
session.Done is used for lease aliveness check. However this is not a strict aliveness protection for resource access for two reasons.
After client receives a new update from watch chan, during the process procedure of resource update, the client’s lease could be timeout, which means the client could still access the resource after it is illegal.
session.Donefor a lease is not triggered in real-time, which means when the lease is timeout and revoked by etcd server, the
session.Donechannel may not be fired immediately. This is because
session.Doneis only notified after the etcd client establish a new keepalive request, there could be a time window as long as 1/3 of session ttl that
session.Doneis not notified.
The goal of aliveness guarantee for resource access can be achieved by using etcd
Txn simply. As there exists a key bounded with client lease, the client can make use of this key to guarantee lease is timeout during other etcd key value operations. The main logic can be as follows.
ErrLeaseTimeout := errors.New("lease associated key is deleted")
In this part I will talk about the implement principle of etcd lease based on code in tag-3.4.14. Basically, each etcd server runs a lease manager which implements the Lessor interface. Most of the lease management is via raft to keep lease information consistent among multiple etcd servers. Take the lease grant operation as an example. When a
LeaseGrantRequest is received by etcd server, the gRPC request will be processed in LeaseGrant of a lease server and return
LeaseRevokeResponse after processing. When processing the LeaseGrantRequest, it will be passed to LeaseGrant function of EtcdServer/Lessor to trigger an internal raft request. Then raft message will be applied via the internal raft mechanism to all servers. When applying the
LeaseGrant message in each etcd server, The Grant function of a
Lessor will be finally called.
The main event loop of a
Leasor contains two periodic jobs,
checkpointScheduledLeases, both of them run every 500ms.
- revokeExpiredLeases finds all leases past their expiry and sends them to an expired channel for revoking, the channel is consumed in etcd server’s main loop. Each lease is associated with a LeaseItem and all lease items are stored in a min heap, the heap item is sorted by the expiration time of lease. When I was reading the code about iterating the expiration heap, I found an interesting code snippet, each time the lessor pops an expired item from the heap, it will put back a new lease item with the same lease ID but adding an
expiredLeaseRetryIntervalto the expired time. This is a patched logic to fix a bug that if the receiver of expired channel does not revoke lease successfully, the lease will be never revoked because it can’t be retrieved from lease expiration heap anymore. More details can be found in this PR.
- checkpointScheduledLeases was introduced since etcd 3.4 in this PR, this PR has described the requirement and mechanism of
lease checkpointingdetailedly. It is designed for the scenario that one etcd leader is transfered, the new leader will rebuild lease information and inherit the remaining ttl of existing leases instead of auto-renew to their full TTL.
In short, the precision of etcd lease is second level, which is reflected in two aspects:
- When a lease is granted, the time unit for ttl is second. Besides there exists a minimum ttl mechanism in etcd.
- Since etcd server uses a lazy way to determine which lease is timeout, instead of some more precise notification mechanism, it adds a latency for lease timeout. This means when we grant a new lease with TTL = N second, and don’t send any keepalive request for this lease, the time window that this lease will be revoked in etcd server is about [N, N + delta second], where delta is generally 0.5, but considering some time cost of other logic, the delta could be more than 0.5. Taking a sample code as example, this code snippet grants a new lease with TTL=5s every 50ms, 20 leases totally. For each lease attaches a key on it and sends a keepalive request to etcd server to refresh lease. Then watches for the key delete operation and records the duration for each lease timeout. From the testing result, the duration of lease revoked is between [5s, 5.6s], which is as expected.
What’s more, etcd server has a hard code limit when revoking lease, each round of expired lease revoking, at most 500 leases can be revoked. This can be easily verified by the code snippet. In this scenario the lease expiry duration will have more latency, a test result is as follows:
In most cases, making a large amount of keys expire at the same time is not a good design. And when we use etcd lease, we must be aware of the lazy expiration mechanism.
Operating systems provide both a “wall clock” which is subject to changes for clock synchronization, and a “monotonic clock” which is not. The general rule is that the wall clock is for telling time and the monotonic clock is for measuring time. Is the etcd lease reliable if the system’s wall clock is updated by NTP service? The answer is yes, both in the etcd server side and etcd client side, the lease implementation is reliable because monotonic clock is used. Since Go 1.9 builtin monotonic time library is provided, etcd makes use of this feature to ensure the safety of time comparison.
- For the server side, both the expiry time setter of a lease and expired checker are using monotonic time.
- For the client side, it uses Time.Before() API to check whether a keepalive request should be sent, which is also clock drift tolerable.
Etcd lease is powerful but has some restrictions, it is better to know the underlying principle of etcd lease, which will help to use it correctly and reasonably.