Best practises for goroutine inspection in golang

Go runtime provides a convenient way to dump stack traces of all current goroutines via pprof, which can save developers a lot time to diagnose the programming problems including deadlock, goroutine pause(such as IO wait, blocking chan receive etc.), goroutine leak. In a long running go process, there could be thousands of goroutines, it takes some time to find the susceptible stack trace from large amount of lines, this article concludes some common methods to discovery suspectable stack trace/goroutine quickly.

Group the goroutines

When dumping the goroutines via /debug/pprof/goroutine?debug=1, the goroutine information is grouped by stack trace signature, goroutines with the same stack trace are listed in a same group, such as following

1
2
3
4
3 @ 0x43b705 0x46f298 0x46f26e 0x47c8b9 0xa428fd 0x473521
# 0x46f26d sync.runtime_notifyListWait+0xcd runtime/sema.go:513
# 0x47c8b8 sync.(*Cond).Wait+0x98 sync/cond.go:56
# 0xa428fc google.golang.org/grpc/internal/transport.(*http2Client).keepalive+0x35c google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:1373

However the debug=1 dump doesn’t contain some critical information, including the goroutine state, the parameter signature of each function call, as well as the goroutine ID. The debut=2 mode will carry more information but the stack traces will be listed one by one. Given a sample debug=2 mode goroutine dump, it contains 4724 goroutines. It is convenient to use goroutine-inspect to analyze the goroutine dump, such as load the sample dump file, it will group the goroutines by state, as following:

1
2
3
4
5
6
7
8
9
10
# of goroutines: 4724

IO wait: 25
chan receive: 590
runnable: 79
running: 1
select: 3249
semacquire: 711
sync.Cond.Wait: 68
syscall: 1

goroutine-inspect provides many ways for goroutine analysis, such as using dedup to fold goroutines with the same stack trace, this group way works the same as debug=1 in pprof. It also supports to query goroutines via govaluate expression evaluation, the supported query patterns can be found in Properties of a Goroutine Dump Item.

Understand the information in a stack trace

A stack trace contains not only the call stack of packages and functions, but also the function parameter signatures, including memory address, structure information(such as len or capacity, etc.). A more detailed analysis can be found in this article, Stack Traces In Go. Apart from the stack trace and parameter signature, goroutine state is the most important part to locate stuck issues, such as network IO blocking, channel receive blocking etc.

Understand the goroutine state

As a matter of fact, the state listed in pprof goroutine dump combines the runtime G status and the waiting reason of blocking G. The G status is defined in go runtime with 9 states, they are

  • idle: goroutine was just allocated and has not yet been initialized.
  • runnable: goroutine is on a run queue. It is not currently executing user code. The stack is not owned.
  • running: goroutine may execute user code. The stack is owned by this goroutine. It is not on a run queue. It is assigned an M and a P (g.m and g.m.p are valid).
  • syscall: goroutine is executing a system call. It is not executing user code. The stack is owned by this goroutine. It is not on a run queue. It is assigned an M.
  • waiting: goroutine is blocked in the runtime.
  • moribund_unused: currently unused, but hardcoded in gdb scripts.
  • dead: goroutine is currently unused. It may be just exited, on a free list, or just being initialized.
  • enqueue_unused: currently unused.
  • copystack: goroutine’s stack is being moved.
  • preempted: goroutine stopped itself for a suspendG preemption.

And the wait reasons are the reason a goroutine is in waiting status, which are defined in go runtime too. The most common patterns we can observe from goroutine dump are G status(including runnable, running, syscall) and many kinds of wait reasons, and wait reasons are often the key information to locate problems.

Waiting time

When a goroutine is blocking for long time, the goroutine dump will have blocing duration record, as the first line goroutine 1 [IO wait, 11 minutes]: in following goroutine dump. This stack trace is triggered by a MySQL database ping operation via a TCP proxy, but adding a delay in the response from TCP proxy to program, which can be treated as a Network IO blocking case.

A network IO blocking goroutine dump

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
goroutine 1 [IO wait, 11 minutes]:
internal/poll.runtime_pollWait(0x261bfb8, 0x72, 0xffffffffffffffff)
runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000148698, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc000148680, 0xc000162000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc000148680, 0xc000162000, 0x1000, 0x1000, 0x230b100, 0x20300000000000, 0x24fffff)
net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0000100a0, 0xc000162000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
net/net.go:183 +0x91
github.com/go-sql-driver/mysql.(*buffer).fill(0xc0001126c0, 0x4, 0x2049f60, 0x2049f60)
github.com/go-sql-driver/mysql@v1.6.0/buffer.go:90 +0x142
github.com/go-sql-driver/mysql.(*buffer).readNext(0xc0001126c0, 0x4, 0x2049f60, 0x0, 0x203000, 0xc00010f830, 0x0)
github.com/go-sql-driver/mysql@v1.6.0/buffer.go:119 +0x9c
github.com/go-sql-driver/mysql.(*mysqlConn).readPacket(0xc0001126c0, 0x2049f60, 0xc00010f948, 0x100f09b, 0x2040108, 0x133bd59)
github.com/go-sql-driver/mysql@v1.6.0/packets.go:32 +0x91
github.com/go-sql-driver/mysql.(*mysqlConn).readHandshakePacket(0xc0001126c0, 0x1000, 0x1000, 0xc000162000, 0x0, 0x135d292, 0xf, 0x13b6d48)
github.com/go-sql-driver/mysql@v1.6.0/packets.go:188 +0x45
github.com/go-sql-driver/mysql.(*connector).Connect(0xc000010098, 0x13b51d0, 0xc00001a0c0, 0x0, 0x0, 0x0, 0x0)
github.com/go-sql-driver/mysql@v1.6.0/connector.go:81 +0x44b
database/sql.(*DB).conn(0xc000079450, 0x13b51d0, 0xc00001a0c0, 0x13b2601, 0xc000010098, 0x0, 0x0)
database/sql/sql.go:1301 +0x184
database/sql.(*DB).PingContext(0xc000079450, 0x13b51d0, 0xc00001a0c0, 0x2e, 0xc000079450)
database/sql/sql.go:799 +0x90
database/sql.(*DB).Ping(...)
database/sql/sql.go:817
main.queryMySQL(0x0, 0x0)
command-line-arguments/main.go:20 +0xe9
main.main()
command-line-arguments/main.go:45 +0x3e

But when using waiting time to diagnose blocking issues, we should pay attention to the blocking duration is not counted from the very beginning of the process. Given the following code snippet as an example and dump golang routines every one minute, we will find the waiting time will be observed after 4 minutes. Besides if the blocking code is terminated by some cancel mechanism such as context.WithTimeout and retried later in a new goroutine, the waiting duration won’t be accumulated, so it is not enough to detect blocking issues just relying on the waiting time. It is better to start with the goroutine state and be familiar with some common patterns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
package main

import (
"net/http"
_ "net/http/pprof"
)

//go:noinline
func sendBlock(ch chan<- bool) {
ch <- true
}

func main() {
go func() {
http.ListenAndServe("0.0.0.0:9200", nil)
}()
ch := make(chan bool)
sendBlock(ch)
}
1
2
3
4
5
6
7
8
9
➜ grep -E --with-filename "goroutine [0-9]+ \[chan send" goroutines*
goroutines.2021-12-05.18:59:32.log:goroutine 1 [chan send]:
goroutines.2021-12-05.19:00:32.log:goroutine 1 [chan send]:
goroutines.2021-12-05.19:01:32.log:goroutine 1 [chan send]:
goroutines.2021-12-05.19:02:32.log:goroutine 1 [chan send]:
goroutines.2021-12-05.19:03:32.log:goroutine 1 [chan send, 2 minutes]:
goroutines.2021-12-05.19:04:32.log:goroutine 1 [chan send, 3 minutes]:
goroutines.2021-12-05.19:05:32.log:goroutine 1 [chan send, 4 minutes]:
goroutines.2021-12-05.19:06:32.log:goroutine 1 [chan send, 5 minutes]:
1
2
3
4
5
goroutine 1 [chan send, 11 minutes]:
main.sendBlock(0xc000110240)
command-line-arguments/select_on_chan_block.go:10 +0x37
main.main()
command-line-arguments/select_on_chan_block.go:18 +0x5c

Common patterns for diagnosing

To be added.

Reference