Latency numbers of different golang operations

Posted on 2021-05-03 In program Symbols count in article: 4.4k Reading time ≈ 4 mins.

Latency is one of the most important metrics in system performance, different systems have various latency requirements, such as the read latency of a relation database maybe less than 50ms, the GC latency of a programming language should be less than 10ms or 1ms, while the latency requirement of two micro services under the same data center could be less than 0.2ms. It is not always latency sensitive in every single part of a system, but as a matter of fact there do exist many components that are latency sensitive and we must be very careful when we design and implement these components or systems.

A lot of articles have talked about system latency, from both the high level, macroscopic perspective, such as the latency of a complex architecture; the latency from systemic interaction such as http API invocation, database read and write, cache access; the latency of operation in programming language such as memory allocation or function call. And the low level, or the underlying system, such as the latency in memory access, IO access, TCP packet transmit, etc. The following latency table is from book Systems Performance, and the project napking-math also provides a table about latency numbers.

Time scale of system latencies

In this article I will focus on the latency in Golang programming language, including API in some golang libraries; golang specific feature such as goroutine create and destroy, channel access; golang runtime latency such as GC latency etc.

The latency numbers table

Operation type	Golang latency	Benchmark environment	Benchmark source
Empty function call	0.4 ns	Based on concurrent count and cpu count, benchmark run with cpu Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, go version 1.16.3	A pure go empty function call
RWMutex RLock + RUlock	15-40 ns	Based on concurrent count and cpu count, benchmark run with cpu Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, go version 1.16.3	ref: go issue, RWMutex scales poorly with CPU count
Cgo function call	70 ns	Benchmark run with cpu Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, go version 1.16.3	ref: cgo benchmarks
Select on a channel	10-100 ns (case1) 100-700 ns (case2)	Based on the lock contention in `runtime.sellock` and `runtime.selunlock`, benchmark run with cpu Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, go version 1.16.3	ref: go issue runtime: select on a shared channel is slow with many Ps case1: private channel case2: shared channel
Thread-safe buffer to simulate a channel	100 ns (case1) 400-500 ns (case2) 400-500 ns (case3)	Benchmark run with cpu Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, go version 1.16.3	ref: So just how fast are channels anyway case1: single writer and reader case2: single writer, multiple readers case3: multiple writer and readers
Create a goroutine and call WaitGroup done once	300-800 ns	Benchmark run with cpu Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, go version 1.16.3	ref: benchmark code
Golang grpc: unary secure ping pong	200-300 us	8-core systems on GCE, gRPC official benchmark	netperf: 70-80us c++: 130-200us ref: grpc testing dashboard and grpc benchmarking
Golang grpc: streaming secure ping pong	150-200 us	8-core systems on GCE, gRPC official benchmark	netperf: 70-80us c++: 100-140us ref: grpc testing dashboard and grpc benchmarking
Go 1.8-1.9, STW pauses per GC	< 2*500 us	Golang official benchmark	ref: Getting to Go: The Journey of Go’s Garbage Collector
Go 1.7-1.8 GC	1.5 ms	Golang official benchmark	ref: Getting to Go: The Journey of Go’s Garbage Collector
Go 1.5 GC, STW pauses every 50ms	10 ms	Golang official benchmark	ref: Go GC: Prioritizing low latency and simplicity

Summary

In the real world system could be far more complicated than the above cases, latency of the whole system is contributed by pieces of code/logic, knowing the latency of each single part is not the silver bullet, but the foundation of performance tuning. Besides we can use some profile and trace tools to diagnose the system performance, such as the cpu profile and trace tools shipped in golang pprof. At last I will quote some advice about performance tuning given by Dave Cheney in the High Performance Go Workshop in a gopher conference.

Start with the simplest possible code.

Measure. Profile your code to identify the bottlenecks, do not guess.

If performance is good, stop. You don’t need to optimise everything, only the hottest parts of your code.

As your application grows, or your traffic pattern evolves, the performance hot spots will change.

Don’t leave complex code that is not performance critical, rewrite it with simpler operations if the bottleneck moves elsewhere.

Always write the simplest code you can, the compiler is optimised for normal code.

Shorter code is faster code; Go is not C++, do not expect the compiler to unravel complicated abstractions.

Shorter code is smaller code; which is important for the CPU’s cache.

Pay very close attention to allocations, avoid unnecessary allocation where possible.

The latency numbers table

Summary

Reference