I've been following the commits to the Go project for some time now.
Occasionally someone will post a commit with benchmarks showing how much
the commit improves performance along some axis or another. In this commit,
they've increased the performance of division by 7 (a
notoriously tricky number to divide by) by about 40% on machines running the
ARM instruction set. I'm not 100% sure what the commit does, but it switches
around the instructions that get generated when you do long division on ARM in
a way that makes things faster.
Anyway, I wanted to learn how to do benchmarks, and practice making something
faster. Some motivation came along when Uber released their go-zap
logging
library, with a set of benchmarks showing my friend Alan's
logging library as the worst performer. So I thought it would be a good
candidate for benchmarking.
Fortunately Alan has already included a set of benchmarks
in the test suite. You can run them by cloning the
project and then calling the
following:
go test -run=^$ -bench=.
You need to pass -run=^$
to exclude all tests in the test suite, otherwise
all of the tests will run and also all of the benchmarks. We only care about
the benchmarks, so -run=^$
filters out every argument.
We get some output!
BenchmarkStreamNoCtx-4 300000 5735 ns/op
BenchmarkDiscard-4 2000000 856 ns/op
BenchmarkCallerFileHandler-4 1000000 1980 ns/op
BenchmarkCallerFuncHandler-4 1000000 1864 ns/op
BenchmarkLogfmtNoCtx-4 500000 3866 ns/op
BenchmarkJsonNoCtx-4 1000000 1816 ns/op
BenchmarkMultiLevelFilter-4 2000000 1015 ns/op
BenchmarkDescendant1-4 2000000 857 ns/op
BenchmarkDescendant2-4 2000000 872 ns/op
BenchmarkDescendant4-4 2000000 1029 ns/op
BenchmarkDescendant8-4 1000000 1018 ns/op
BenchmarkLog15AddingFields-4 50000 29492 ns/op
BenchmarkLog15WithAccumulatedContext-4 50000 33599 ns/op
BenchmarkLog15WithoutFields-4 200000 9417 ns/op
PASS
ok github.com/inconshreveable/log15 30.083s
The name on the left is the benchmark name. The number (4) is the number
of CPU's used for the benchmark. The number in the middle is the number of
test runs; to get a good benchmark, you want to run a thing as many times as
feasible, then divide the total runtime by the number of test runs. Otherwise
you run into problems like coordinated omission and weighting the
extreme outliers too much, or failing to weight them at all.
To get a "good" benchmark, you want to try to isolate the running code from
anything else running on the machine. For example, say you run the tip
benchmarks while a Youtube video is playing, make a change, and then run the
benchmarks while nothing is playing. Video players require a lot of CPU/RAM
to play videos, and all other things being equal, the benchmark is going to
be worse with Youtube running. There are a lot of ways to accidentally bias
the results as well, for example you might get bored with the tip benchmarks
and browse around, then make a change and observe the console intensely to see
the new results. You're biasing the results simply by not switching tabs or
screens!
If you have access to a Linux machine with nothing else running on it, that's
going to be your best bet for benchmarking. Otherwise, shut down as many other
programs as are feasible on your main machine before starting any benchmark
tests.
Running a benchmark multiple times can also be a good way to compensate for
environmental effects. Russ Cox's benchstat program is very
useful for this; it gathers many runs of a benchmark, and tells you whether the
results are statistically significant. Run your benchmark with the -count
flag to run it multiple times in a row:
go test -count=20 -run=^$ -bench=Log15AddingFields | tee -a master-benchmark
Do the same for your change, writing to a different file (change-benchmark
),
then run benchstat:
benchstat master-benchmark change-benchmark
You'll get really nice looking output. This is generally the output used by the
Go core development team to print benchmark results.
$ benchstat benchmarks/tip benchmarks/early-time-exit
name old time/op new time/op delta
StreamNoCtx-4 3.60µs ± 6% 3.17µs ± 1% -12.02% (p=0.001 n=8+6)
Discard-4 837ns ± 1% 804ns ± 3% -3.94% (p=0.001 n=7+6)
CallerFileHandler-4 1.94µs ± 2% 1.88µs ± 0% -3.01% (p=0.003 n=7+5)
CallerFuncHandler-4 1.72µs ± 3% 1.65µs ± 1% -3.98% (p=0.001 n=7+6)
LogfmtNoCtx-4 2.39µs ± 2% 2.04µs ± 1% -14.69% (p=0.001 n=8+6)
JsonNoCtx-4 1.68µs ± 1% 1.66µs ± 0% -1.08% (p=0.001 n=7+6)
MultiLevelFilter-4 880ns ± 2% 832ns ± 0% -5.44% (p=0.001 n=7+6)
Descendant1-4 855ns ± 3% 818ns ± 1% -4.28% (p=0.000 n=8+6)
Descendant2-4 872ns ± 3% 830ns ± 2% -4.87% (p=0.001 n=7+6)
Descendant4-4 934ns ± 1% 893ns ± 1% -4.41% (p=0.001 n=7+6)
Descendant8-4 990ns ± 2% 958ns ± 2% -3.29% (p=0.002 n=7+6)
OK! So now we have a framework for measuring whether a change helps or hurts.
How can I make my code faster?
Good question! There's no one answer for this. In general, there are three
strategies:
-
Figure out a way to do less work than you did before. Avoid doing an
expensive computation where it's not necessary, exit early from functions, &c.
-
Do the same work, but in a faster way; use a different algorithm, or use
a different function, that's faster.
-
Do the same work, but parallelize it across multiple CPU's, or threads.
Each technique is useful in different places, and it can be hard to predict
where you'll be able to extract performance improvements. It is useful to know
how expensive various operations are, so you can evaluate
the costs of a given operation.
It's also easy to spend a lot of time "optimizing" something only to realize
it's not the bottleneck in your program. If you optimize something that takes
5% of the code's execution time, the best overall speedup you can get is 5%,
even if you get the code to run instantaneously. Go's test framework has tools
for figuring out where your code is spending the majority of its time. To get
the best use out of them, focus on profiling the code execution for a single
benchmark. Here, I'm profiling the StreamNoCtx benchmark in the
log15 library.
$ go test -cpuprofile=cpu.out -benchmem -memprofile=mem.out -run=^$ -bench=StreamNoCtx -v
BenchmarkStreamNoCtx-4 500000 3502 ns/op 440 B/op 12 allocs/op
This will generate 3 files: cpu.out
, mem.out
, and log15.test
. These are
binary files. You want to use the pprof tool to evaluate them. First
let's look at the CPU profile; I've started it and then run top10
to get the
top 10 functions.
$ go tool pprof log15.test cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top5
560ms of 1540ms total (36.36%)
Showing top 5 nodes out of 112 (cum >= 60ms)
flat flat% sum% cum cum%
180ms 11.69% 11.69% 400ms 25.97% runtime.mallocgc
160ms 10.39% 22.08% 160ms 10.39% runtime.mach_semaphore_signal
100ms 6.49% 28.57% 260ms 16.88% github.com/inconshreveable/log15.escapeString
60ms 3.90% 32.47% 70ms 4.55% bytes.(*Buffer).WriteByte
60ms 3.90% 36.36% 60ms 3.90% runtime.stringiter2
The top 5 functions are responsible for 36% of the program's runtime. flat
is
how much time is spent inside of a function, cum
shows how much time is spent
in a function, and also in any code called by a function. Of these 5,
runtime.stringiter2
, runtime.mallocgc
, and runtime.mach_semaphore_signal
are not good candidates for optimization. They are very specialized pieces of
code, and they're part of the Go runtime, so changes need to pass all tests and
get approved by the Go core development team. We could potentially figure out
how to call them less often though - mallocgc
indicates we are creating lots
of objects. Maybe we could figure out how to create fewer objects.
The likeliest candidate for improvement is the escapeString
function in our
own codebase. The list
function is super useful for checking this.
(pprof) list escapeString
ROUTINE ======================== github.com/inconshreveable/log15.escapeString in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
30ms 330ms (flat, cum) 23.40% of Total
10ms 10ms 225:func escapeString(s string) string {
. . 226: needQuotes := false
. 90ms 227: e := bytes.Buffer{}
. . 228: e.WriteByte('"')
10ms 50ms 229: for _, r := range s {
. . 230: if r <= ' ' || r == '=' || r == '"' {
. . 231: needQuotes = true
. . 232: }
. . 233:
. . 234: switch r {
. . 235: case '\\', '"':
. . 236: e.WriteByte('\\')
. . 237: e.WriteByte(byte(r))
. . 238: case '\n':
. . 239: e.WriteByte('\\')
. . 240: e.WriteByte('n')
. . 241: case '\r':
. . 242: e.WriteByte('\\')
. . 243: e.WriteByte('r')
. . 244: case '\t':
. . 245: e.WriteByte('\\')
. . 246: e.WriteByte('t')
. . 247: default:
. 100ms 248: e.WriteRune(r)
. . 249: }
. . 250: }
. 10ms 251: e.WriteByte('"')
. . 252: start, stop := 0, e.Len()
. . 253: if !needQuotes {
. . 254: start, stop = 1, stop-1
. . 255: }
10ms 70ms 256: return string(e.Bytes()[start:stop])
The basic idea here is to write a string, but add a backslash before double
quotes, newlines, and tab characters. Some ideas for improvement:
-
We create a bytes.Buffer{}
at the beginning of the function. We could keep
a Pool of buffers, and retrieve one, so we don't need to
allocate memory for a buffer each time we escape a string.
-
If a string doesn't contain a double quote, newline, tab, or carriage return,
it doesn't need to be escaped. Maybe we can avoid creating the Buffer entirely
for that case, if we can find a fast way to check whether a string has those
characters in it.
-
If we call WriteByte
twice in a row, we could probably replace it with
a WriteString(), which will use a copy
to move two bytes, instead of
growing the array twice.
-
We call e.Bytes()
and then cast the result to a string. Maybe we can figure
out how to call e.String()
directly.
You can then look at how much time is being spent in each area to get an idea
of how much any given idea will help your benchmarks. For example, replacing
WriteByte with WriteString probably won't save much time; you're at most saving
the time it takes to write every quote and newline, and most strings are made
up of letters and numbers and space characters instead. (It doesn't show up at
all in the benchmark but that's because the benchmark writes the phrase "test
message" over and over again, and that doesn't have any escape-able
characters).
That's the CPU benchmark; how much time the CPU spends running each function in
the codebase. There's also the memory profile; how much memory each function
allocates. Let's look at that! We have to pass one of these flags to pprof
to get it to show memory information.
Sample value selection option (for heap profiles):
-inuse_space Display in-use memory size
-inuse_objects Display in-use object counts
-alloc_space Display allocated memory size
-alloc_objects Display allocated object counts
Let's pass one. Notice in this case, the top 5 functions allocate 96% of the
objects:
go tool pprof -alloc_objects log15.test mem.out
(pprof) top5
6612331 of 6896359 total (95.88%)
Showing top 5 nodes out of 18 (cum >= 376843)
flat flat% sum% cum cum%
2146370 31.12% 31.12% 6612331 95.88% github.com/inconshreveable/log15.LogfmtFormat.func1
1631482 23.66% 54.78% 1631482 23.66% github.com/inconshreveable/log15.escapeString
1540119 22.33% 77.11% 4465961 64.76% github.com/inconshreveable/log15.logfmt
917517 13.30% 90.42% 1294360 18.77% github.com/inconshreveable/log15.formatShared
376843 5.46% 95.88% 376843 5.46% time.Time.Format
Let's look at a function:
ROUTINE ======================== github.com/inconshreveable/log15.logfmt in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
1540119 4465961 (flat, cum) 64.76% of Total
. . 97: if i != 0 {
. . 98: buf.WriteByte(' ')
. . 99: }
. . 100:
. . 101: k, ok := ctx[i].(string)
. 2925842 102: v := formatLogfmtValue(ctx[i+1])
. . 103: if !ok {
. . 104: k, v = errorKey, formatLogfmtValue(k)
. . 105: }
. . 106:
. . 107: // XXX: we should probably check that all of your key bytes aren't invalid
. . 108: if color > 0 {
. . 109: fmt.Fprintf(buf, "\x1b[%dm%s\x1b[0m=%s", color, k, v)
. . 110: } else {
1540119 1540119 111: fmt.Fprintf(buf, "%s=%s", k, v)
. . 112: }
. . 113: }
In the common case on line 111 (when color = 0), we're calling fmt.Fprintf
to write the key, then an equals sign, then the value. Fprintf
also has to
allocate memory for its own byte buffer, then parse the format string, then
interpolate the two variables. It might be faster, and avoid allocations,
to just call buf.WriteString(k)
, then write the equals sign, then call
buf.WriteString(v)
. But you'd want to benchmark it first to double check!
Conclusion
Using a combination of these techniques, I was able to improve the performance
of log15
by about 30% for some common code paths, and reduce memory
allocations as well. I was not expecting this, but I also found a way to speed
up JSON encoding in the Go standard library by about 20%!
Want someone to benchmark/improve performance in your company's application,
or teach your team more about benchmarking? I am available for
consulting; email me and let's set something up!
Liked what you read? I am available for hire.