Go runtime provides a convenient way to dump stack traces of all current goroutines via pprof, which can save developers a lot time to diagnose the programming problems including deadlock, goroutine pause(such as IO wait, blocking chan receive etc.), goroutine leak. In a long running go process, there could be thousands of goroutines, it takes some time to find the susceptible stack trace from large amount of lines, this article concludes some common methods to discovery suspectable stack trace/goroutine quickly.
When dumping the goroutines via
/debug/pprof/goroutine?debug=1, the goroutine information is grouped by stack trace signature, goroutines with the same stack trace are listed in a same group, such as following
3 @ 0x43b705 0x46f298 0x46f26e 0x47c8b9 0xa428fd 0x473521
debug=1 dump doesn’t contain some critical information, including the goroutine state, the parameter signature of each function call, as well as the goroutine ID. The
debut=2 mode will carry more information but the stack traces will be listed one by one. Given a sample
debug=2 mode goroutine dump, it contains 4724 goroutines. It is convenient to use goroutine-inspect to analyze the goroutine dump, such as load the sample dump file, it will group the goroutines by state, as following:
# of goroutines: 4724
goroutine-inspect provides many ways for goroutine analysis, such as using
dedup to fold goroutines with the same stack trace, this group way works the same as
debug=1 in pprof. It also supports to query goroutines via govaluate expression evaluation, the supported query patterns can be found in Properties of a Goroutine Dump Item.
A stack trace contains not only the call stack of packages and functions, but also the function parameter signatures, including memory address, structure information(such as len or capacity, etc.). A more detailed analysis can be found in this article, Stack Traces In Go. Apart from the stack trace and parameter signature, goroutine state is the most important part to locate stuck issues, such as network IO blocking, channel receive blocking etc.
As a matter of fact, the state listed in pprof goroutine dump combines the runtime G status and the waiting reason of blocking G. The G status is defined in go runtime with 9 states, they are
idle: goroutine was just allocated and has not yet been initialized.
runnable: goroutine is on a run queue. It is not currently executing user code. The stack is not owned.
running: goroutine may execute user code. The stack is owned by this goroutine. It is not on a run queue. It is assigned an M and a P (g.m and g.m.p are valid).
syscall: goroutine is executing a system call. It is not executing user code. The stack is owned by this goroutine. It is not on a run queue. It is assigned an M.
waiting: goroutine is blocked in the runtime.
moribund_unused: currently unused, but hardcoded in gdb scripts.
dead: goroutine is currently unused. It may be just exited, on a free list, or just being initialized.
enqueue_unused: currently unused.
copystack: goroutine’s stack is being moved.
preempted: goroutine stopped itself for a suspendG preemption.
And the wait reasons are the reason a goroutine is in waiting status, which are defined in go runtime too. The most common patterns we can observe from goroutine dump are G status(including
syscall) and many kinds of wait reasons, and wait reasons are often the key information to locate problems.
When a goroutine is blocking for long time, the goroutine dump will have blocing duration record, as the first line
goroutine 1 [IO wait, 11 minutes]: in following goroutine dump. This stack trace is triggered by a MySQL database ping operation via a TCP proxy, but adding a delay in the response from TCP proxy to program, which can be treated as a Network IO blocking case.
But when using waiting time to diagnose blocking issues, we should pay attention to the blocking duration is not counted from the very beginning of the process. Given the following code snippet as an example and dump golang routines every one minute, we will find the waiting time will be observed after 4 minutes. Besides if the blocking code is terminated by some cancel mechanism such as
context.WithTimeout and retried later in a new goroutine, the waiting duration won’t be accumulated, so it is not enough to detect blocking issues just relying on the waiting time. It is better to start with the goroutine state and be familiar with some common patterns.
➜ grep -E --with-filename "goroutine [0-9]+ \[chan send" goroutines*
goroutine 1 [chan send, 11 minutes]:
To be added.
- golang official docs about pprof
- goroutine-inspect, an interactive tool to analyze Golang goroutine dump
- Stack Traces In Go, talking about information the stack trace provides, including how to identify the value for each parameter that was passed into each function
- go-profiler-notes, summarize of go profiler