Posts Tagged With: Code

Real Life Go Benchmarking

Posted on July 2, 2016

I've been following the commits to the Go project for some time now. Occasionally someone will post a commit with benchmarks showing how much the commit improves performance along some axis or another. In this commit, they've increased the performance of division by 7 (a notoriously tricky number to divide by) by about 40% on machines running the ARM instruction set. I'm not 100% sure what the commit does, but it switches around the instructions that get generated when you do long division on ARM in a way that makes things faster.

Anyway, I wanted to learn how to do benchmarks, and practice making something faster. Some motivation came along when Uber released their go-zap logging library, with a set of benchmarks showing my friend Alan's logging library as the worst performer. So I thought it would be a good candidate for benchmarking.

Fortunately Alan has already included a set of benchmarks in the test suite. You can run them by cloning the project and then calling the following:

go test -run=^$ -bench=.

You need to pass -run=^$ to exclude all tests in the test suite, otherwise all of the tests will run and also all of the benchmarks. We only care about the benchmarks, so -run=^$ filters out every argument.

We get some output!

BenchmarkStreamNoCtx-4                   	  300000	      5735 ns/op
BenchmarkDiscard-4                       	 2000000	       856 ns/op
BenchmarkCallerFileHandler-4             	 1000000	      1980 ns/op
BenchmarkCallerFuncHandler-4             	 1000000	      1864 ns/op
BenchmarkLogfmtNoCtx-4                   	  500000	      3866 ns/op
BenchmarkJsonNoCtx-4                     	 1000000	      1816 ns/op
BenchmarkMultiLevelFilter-4              	 2000000	      1015 ns/op
BenchmarkDescendant1-4                   	 2000000	       857 ns/op
BenchmarkDescendant2-4                   	 2000000	       872 ns/op
BenchmarkDescendant4-4                   	 2000000	      1029 ns/op
BenchmarkDescendant8-4                   	 1000000	      1018 ns/op
BenchmarkLog15AddingFields-4             	   50000	     29492 ns/op
BenchmarkLog15WithAccumulatedContext-4   	   50000	     33599 ns/op
BenchmarkLog15WithoutFields-4            	  200000	      9417 ns/op
PASS
ok  	github.com/inconshreveable/log15	30.083s

The name on the left is the benchmark name. The number (4) is the number of CPU's used for the benchmark. The number in the middle is the number of test runs; to get a good benchmark, you want to run a thing as many times as feasible, then divide the total runtime by the number of test runs. Otherwise you run into problems like coordinated omission and weighting the extreme outliers too much, or failing to weight them at all.

To get a "good" benchmark, you want to try to isolate the running code from anything else running on the machine. For example, say you run the tip benchmarks while a Youtube video is playing, make a change, and then run the benchmarks while nothing is playing. Video players require a lot of CPU/RAM to play videos, and all other things being equal, the benchmark is going to be worse with Youtube running. There are a lot of ways to accidentally bias the results as well, for example you might get bored with the tip benchmarks and browse around, then make a change and observe the console intensely to see the new results. You're biasing the results simply by not switching tabs or screens!

If you have access to a Linux machine with nothing else running on it, that's going to be your best bet for benchmarking. Otherwise, shut down as many other programs as are feasible on your main machine before starting any benchmark tests.

Running a benchmark multiple times can also be a good way to compensate for environmental effects. Russ Cox's benchstat program is very useful for this; it gathers many runs of a benchmark, and tells you whether the results are statistically significant. Run your benchmark with the -count flag to run it multiple times in a row:

go test -count=20 -run=^$ -bench=Log15AddingFields | tee -a master-benchmark

Do the same for your change, writing to a different file (change-benchmark), then run benchstat:

benchstat master-benchmark change-benchmark

You'll get really nice looking output. This is generally the output used by the Go core development team to print benchmark results.

$ benchstat benchmarks/tip benchmarks/early-time-exit
name                 old time/op  new time/op  delta
StreamNoCtx-4        3.60µs ± 6%  3.17µs ± 1%  -12.02%  (p=0.001 n=8+6)
Discard-4             837ns ± 1%   804ns ± 3%   -3.94%  (p=0.001 n=7+6)
CallerFileHandler-4  1.94µs ± 2%  1.88µs ± 0%   -3.01%  (p=0.003 n=7+5)
CallerFuncHandler-4  1.72µs ± 3%  1.65µs ± 1%   -3.98%  (p=0.001 n=7+6)
LogfmtNoCtx-4        2.39µs ± 2%  2.04µs ± 1%  -14.69%  (p=0.001 n=8+6)
JsonNoCtx-4          1.68µs ± 1%  1.66µs ± 0%   -1.08%  (p=0.001 n=7+6)
MultiLevelFilter-4    880ns ± 2%   832ns ± 0%   -5.44%  (p=0.001 n=7+6)
Descendant1-4         855ns ± 3%   818ns ± 1%   -4.28%  (p=0.000 n=8+6)
Descendant2-4         872ns ± 3%   830ns ± 2%   -4.87%  (p=0.001 n=7+6)
Descendant4-4         934ns ± 1%   893ns ± 1%   -4.41%  (p=0.001 n=7+6)
Descendant8-4         990ns ± 2%   958ns ± 2%   -3.29%  (p=0.002 n=7+6)

OK! So now we have a framework for measuring whether a change helps or hurts.

How can I make my code faster?

Good question! There's no one answer for this. In general, there are three strategies:

Figure out a way to do less work than you did before. Avoid doing an expensive computation where it's not necessary, exit early from functions, &c.
Do the same work, but in a faster way; use a different algorithm, or use a different function, that's faster.
Do the same work, but parallelize it across multiple CPU's, or threads.

Each technique is useful in different places, and it can be hard to predict where you'll be able to extract performance improvements. It is useful to know how expensive various operations are, so you can evaluate the costs of a given operation.

It's also easy to spend a lot of time "optimizing" something only to realize it's not the bottleneck in your program. If you optimize something that takes 5% of the code's execution time, the best overall speedup you can get is 5%, even if you get the code to run instantaneously. Go's test framework has tools for figuring out where your code is spending the majority of its time. To get the best use out of them, focus on profiling the code execution for a single benchmark. Here, I'm profiling the StreamNoCtx benchmark in the log15 library.

$ go test -cpuprofile=cpu.out -benchmem -memprofile=mem.out -run=^$ -bench=StreamNoCtx -v
BenchmarkStreamNoCtx-4   	  500000	      3502 ns/op	     440 B/op	      12 allocs/op

This will generate 3 files: cpu.out, mem.out, and log15.test. These are binary files. You want to use the pprof tool to evaluate them. First let's look at the CPU profile; I've started it and then run top10 to get the top 10 functions.

$ go tool pprof log15.test cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top5
560ms of 1540ms total (36.36%)
Showing top 5 nodes out of 112 (cum >= 60ms)
      flat  flat%   sum%        cum   cum%
     180ms 11.69% 11.69%      400ms 25.97%  runtime.mallocgc
     160ms 10.39% 22.08%      160ms 10.39%  runtime.mach_semaphore_signal
     100ms  6.49% 28.57%      260ms 16.88%  github.com/inconshreveable/log15.escapeString
      60ms  3.90% 32.47%       70ms  4.55%  bytes.(*Buffer).WriteByte
      60ms  3.90% 36.36%       60ms  3.90%  runtime.stringiter2

The top 5 functions are responsible for 36% of the program's runtime. flat is how much time is spent inside of a function, cum shows how much time is spent in a function, and also in any code called by a function. Of these 5, runtime.stringiter2, runtime.mallocgc, and runtime.mach_semaphore_signal are not good candidates for optimization. They are very specialized pieces of code, and they're part of the Go runtime, so changes need to pass all tests and get approved by the Go core development team. We could potentially figure out how to call them less often though - mallocgc indicates we are creating lots of objects. Maybe we could figure out how to create fewer objects.

The likeliest candidate for improvement is the escapeString function in our own codebase. The list function is super useful for checking this.

(pprof) list escapeString
ROUTINE ======================== github.com/inconshreveable/log15.escapeString in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
      30ms      330ms (flat, cum) 23.40% of Total
      10ms       10ms    225:func escapeString(s string) string {
         .          .    226:	needQuotes := false
         .       90ms    227:	e := bytes.Buffer{}
         .          .    228:	e.WriteByte('"')
      10ms       50ms    229:	for _, r := range s {
         .          .    230:		if r <= ' ' || r == '=' || r == '"' {
         .          .    231:			needQuotes = true
         .          .    232:		}
         .          .    233:
         .          .    234:		switch r {
         .          .    235:		case '\\', '"':
         .          .    236:			e.WriteByte('\\')
         .          .    237:			e.WriteByte(byte(r))
         .          .    238:		case '\n':
         .          .    239:			e.WriteByte('\\')
         .          .    240:			e.WriteByte('n')
         .          .    241:		case '\r':
         .          .    242:			e.WriteByte('\\')
         .          .    243:			e.WriteByte('r')
         .          .    244:		case '\t':
         .          .    245:			e.WriteByte('\\')
         .          .    246:			e.WriteByte('t')
         .          .    247:		default:
         .      100ms    248:			e.WriteRune(r)
         .          .    249:		}
         .          .    250:	}
         .       10ms    251:	e.WriteByte('"')
         .          .    252:	start, stop := 0, e.Len()
         .          .    253:	if !needQuotes {
         .          .    254:		start, stop = 1, stop-1
         .          .    255:	}
      10ms       70ms    256:	return string(e.Bytes()[start:stop])

The basic idea here is to write a string, but add a backslash before double quotes, newlines, and tab characters. Some ideas for improvement:

We create a bytes.Buffer{} at the beginning of the function. We could keep a Pool of buffers, and retrieve one, so we don't need to allocate memory for a buffer each time we escape a string.
If a string doesn't contain a double quote, newline, tab, or carriage return, it doesn't need to be escaped. Maybe we can avoid creating the Buffer entirely for that case, if we can find a fast way to check whether a string has those characters in it.
If we call WriteByte twice in a row, we could probably replace it with a WriteString(), which will use a copy to move two bytes, instead of growing the array twice.
We call e.Bytes() and then cast the result to a string. Maybe we can figure out how to call e.String() directly.

You can then look at how much time is being spent in each area to get an idea of how much any given idea will help your benchmarks. For example, replacing WriteByte with WriteString probably won't save much time; you're at most saving the time it takes to write every quote and newline, and most strings are made up of letters and numbers and space characters instead. (It doesn't show up at all in the benchmark but that's because the benchmark writes the phrase "test message" over and over again, and that doesn't have any escape-able characters).

That's the CPU benchmark; how much time the CPU spends running each function in the codebase. There's also the memory profile; how much memory each function allocates. Let's look at that! We have to pass one of these flags to pprof to get it to show memory information.

Sample value selection option (for heap profiles):
  -inuse_space      Display in-use memory size
  -inuse_objects    Display in-use object counts
  -alloc_space      Display allocated memory size
  -alloc_objects    Display allocated object counts

Let's pass one. Notice in this case, the top 5 functions allocate 96% of the objects:

go tool pprof -alloc_objects log15.test mem.out
(pprof) top5
6612331 of 6896359 total (95.88%)
Showing top 5 nodes out of 18 (cum >= 376843)
      flat  flat%   sum%        cum   cum%
   2146370 31.12% 31.12%    6612331 95.88%  github.com/inconshreveable/log15.LogfmtFormat.func1
   1631482 23.66% 54.78%    1631482 23.66%  github.com/inconshreveable/log15.escapeString
   1540119 22.33% 77.11%    4465961 64.76%  github.com/inconshreveable/log15.logfmt
    917517 13.30% 90.42%    1294360 18.77%  github.com/inconshreveable/log15.formatShared
    376843  5.46% 95.88%     376843  5.46%  time.Time.Format

Let's look at a function:

ROUTINE ======================== github.com/inconshreveable/log15.logfmt in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
   1540119    4465961 (flat, cum) 64.76% of Total
         .          .     97:		if i != 0 {
         .          .     98:			buf.WriteByte(' ')
         .          .     99:		}
         .          .    100:
         .          .    101:		k, ok := ctx[i].(string)
         .    2925842    102:		v := formatLogfmtValue(ctx[i+1])
         .          .    103:		if !ok {
         .          .    104:			k, v = errorKey, formatLogfmtValue(k)
         .          .    105:		}
         .          .    106:
         .          .    107:		// XXX: we should probably check that all of your key bytes aren't invalid
         .          .    108:		if color > 0 {
         .          .    109:			fmt.Fprintf(buf, "\x1b[%dm%s\x1b[0m=%s", color, k, v)
         .          .    110:		} else {
   1540119    1540119    111:			fmt.Fprintf(buf, "%s=%s", k, v)
         .          .    112:		}
         .          .    113:	}

In the common case on line 111 (when color = 0), we're calling fmt.Fprintf to write the key, then an equals sign, then the value. Fprintf also has to allocate memory for its own byte buffer, then parse the format string, then interpolate the two variables. It might be faster, and avoid allocations, to just call buf.WriteString(k), then write the equals sign, then call buf.WriteString(v). But you'd want to benchmark it first to double check!

Conclusion

Using a combination of these techniques, I was able to improve the performance of log15 by about 30% for some common code paths, and reduce memory allocations as well. I was not expecting this, but I also found a way to speed up JSON encoding in the Go standard library by about 20%!

Want someone to benchmark/improve performance in your company's application, or teach your team more about benchmarking? I am available for consulting; email me and let's set something up!

Liked what you read? I am available for hire.

Cleaning up Parallel Tests in Go 1.7

Posted on June 15, 2016

I have a lot of tests in Go that integrate with Postgres, and test the interactions between Go models and the database.

A lot of these tests can run in parallel. For example, any test that attempts to write a record, but fails with a constraint failure, can run in parallel with all other tests. A test that tries to read a random database ID and expects to not fetch a record can run in parallel with other tests. If you write your tests so they all use random UUID's, or all run inside of transactions, you can run them in parallel. You can use this technique to keep your test suite pretty fast, even if each individual test takes 20-40 milliseconds.

You can mark a test to run in parallel by calling t.Parallel() at the top of the test. Here's an example test from the job queue Rickover:

func TestCreateMissingFields(t *testing.T) {
  t.Parallel()
  test.SetUp(t)
  job := models.Job{
    Name: "email-signup",
  }
  _, err := jobs.Create(job)
  test.AssertError(t, err, "")
  test.AssertEquals(t, err.Error(), "Invalid delivery_strategy: \"\"")
}

This test will run in parallel with other tests marked Parallel and only with other tests marked Parallel; all other tests run sequentially.

The problem comes when you want to clear the database. If you have a t.Parallel() test clean up after it has made its assertions, it might try to clear the database while another Parallel() test is still running! That wouldn't be good at all. Presumably, the sequential tests are expecting the database to be cleared. (They could clear it at the start of the test, but this might lead to unnecessary extra DB writes; it's better for tests that alter the database to clean up after themselves).

(You can also run every test in a transaction, and roll it back at the end. Which is great, and gives you automatic isolation! But you have to pass a *sql.Tx around everywhere, and make two additional roundtrips to the database, which you probably also need to do in your application).

Go 1.7 adds the ability to nest tests. Which means we can run setup once, run every parallel test, then tear down once. Something like this (from the docs):

func TestTeardownParallel(t *testing.T) {
  // This Run will not return until the parallel tests finish.
  t.Run("group", func(t *testing.T) {
    t.Run("Test1", parallelTest1)
    t.Run("Test2", parallelTest2)
    t.Run("Test3", parallelTest3)
  })
  // <tear-down code>
}

Note you have to lowercase the function names for the parallel tests, or they'll run inside of the test block, and then again, individually. I settled on this pattern:

var parallelTests = []func(*testing.T){
  testCreate,
  testCreateEmptyPriority,
  testUniqueFailure,
  testGet,
}
func TestAll(t *testing.T) {
  test.SetUp(t)
  defer test.TearDown(t)
  t.Run("Parallel", func(t *testing.T) {
    for _, parallelTest := range parallelTests {
      test.F(t, parallelTest)
    }
  })
}

The test mentioned there is the set of test helpers from the Let's Encrypt project, plus some of my own. test.F finds the defined function name, capitalizes it, and passes the result to test.Run:

// capitalize the first letter in the string
func capitalize(s string) string {
  r, size := utf8.DecodeRuneInString(s)
  return fmt.Sprintf("%c", unicode.ToTitle(r)) + s[size:]
}
func F(t *testing.T, f func(*testing.T)) {
  longfuncname := runtime.FuncForPC(reflect.ValueOf(f).Pointer()).Name()
  funcnameparts := strings.Split(longfuncname, ".")
  funcname := funcnameparts[len(funcnameparts)-1]
  t.Run(capitalize(funcname), f)
}

The result is a set of parallel tests that run a cleanup action exactly once. The downside is the resulting tests have two levels of nesting; you have to define a second t.Run that waits for the parallel tests to complete.

=== RUN   TestAll
=== RUN   TestAll/Parallel
=== RUN   TestAll/Parallel/TestCreate
=== RUN   TestAll/Parallel/TestCreateEmptyPriority
=== RUN   TestAll/Parallel/TestUniqueFailure
=== RUN   TestAll/Parallel/TestGet
--- PASS: TestAll (0.03s)
    --- PASS: TestAll/Parallel (0.00s)
        --- PASS: TestAll/Parallel/TestCreate (0.01s)
        --- PASS: TestAll/Parallel/TestCreateEmptyPriority (0.01s)
        --- PASS: TestAll/Parallel/TestUniqueFailure (0.01s)
        --- PASS: TestAll/Parallel/TestGet (0.02s)

The other thing that might trip you up: If you add print statements to your tear down lines, they'll appear in the console output before the PASS lines. However, I verified they run after all of your parallel tests are finished running.

Liked what you read? I am available for hire.

Buying stocks without a time limit

Posted on June 6, 2016

Recently I put the maximum amount of cash into my IRA account. Since stock prices jump up and down all the time, I wondered whether the current price would be the best one to buy the stock at. In particular, I'm not withdrawing money from my IRA for the better part of 40 years, so I'm not going to sell it — I want the absolute lowest price possible.

Stock markets offer a type of order that's useful for this - you can put in a limit order, which says essentially "I'll buy N shares at price X", where X is some price below the last traded price of the stock. Anyone can take the other side of that order at any time, which usually happens when the stock price drops below the limit order! So you can put in a limit order at any time and it will execute whenever the stock price drops below that amount. I'm simplifying slightly but that's the gist.

So if the current stock price is X, what price should I put in my limit order for? You should be able to come up with some kind of model of the stock price and how it moves and say, roughly, "If you set your limit order at Z, there will be a ~98% chance of it executing in the next year." The normal criticism of stock price models is they underestimate the probability of extreme events, which is alright for us, since extreme events make it more likely our limit order will trigger.

I figured someone studied this problem before so I asked about it on Quant Stack Exchange and I got a really good answer! The answer pointed to two different papers, which have attempted to model this.

Essentially the model looks at the volatility of the stock - how often it jumps up and down - the stock's current price, the probability you want the order to hit, and the amount of time you want to wait. The formula is a little complex but it looks like this:

I wrote some code to compute these values for the stocks I wanted to buy. The hardest part is figuring out the volatility, because you need access to a lot of stock prices, and you'll want to compute the standard deviation from that. I was able to get historical price data from Quandl, and current stock prices from Yahoo.

Combine the two and you get output that looks like this:

$ go run erf.go main.go --stock=<stock> --total=5500 --percent 0.95
the standard deviation is: 0.006742
annualized: 0.107022
current price: 74.95
Based on this stock's volatility, you should set a limit order for: $74.35.

Compared with buying it at the current price, you'll be able to buy 0.6 extra shares (a value of $43.77)

Pretty neat! I've put the code up on Github, so you can take a look for yourself.

Liked what you read? I am available for hire.

A Two Month Debugging Story

Posted on May 26, 2016

This is the story about an error that caused our tests to fail maybe one out of one hundred builds. Recently we figured out why this happened and deployed code to fix it.

The Problem

I've been fighting slow test times in Javascript and Sails since pretty much the day I started at Shyp. We run about 6000 tests on a popular CI service, and they take about 7 minutes to run end to end. This is pretty slow, so we spend a lot of time thinking about ways to make them faster.

To ensure each test has a clean slate, we clear the database between each test. Normally you clear tables by running TRUNCATE TABLE table1 table2 table3 table4 CASCADE. Unfortunately this takes about 200ms, which is too slow when you want to run it between every test - it would double the runtime of our test suite.

DELETE FROM runs much faster, but if you have records in table A that reference table B, DELETE FROM tableB will fail, since records in A still point to it. We could draw a dependency graph between our 60 tables and issue the DELETEs in the correct order, but this seems painful to maintain as we add more constraints. Instead our contractor James Cropcho found a solution - disable the constraints, run the DELETE, then re-enable the constraints. This ran in about 20ms.

ALTER TABLE tableA DISABLE TRIGGER ALL;
DELETE FROM tableA;
ALTER TABLE tableA ENABLE TRIGGER ALL;

This worked for a few months. Unfortunately, occasionally a test would time out and fail after 18 seconds - maybe 1 in 100 test runs would fail with a timeout. Normally when the tests fail you get back some kind of error message — a Postgres constraint failure will print a useful message, or a Javascript exception will bubble up, or Javascript will complain about an unhandled rejection.

With this error we didn't get any of those. We also observed it could happen anywhere - it didn't seem to correlate with any individual test or test file.

Where to start looking

For a long time we suspected that our ORM was losing some kind of race and a callback wasn't being hit, deep in the framework code. This type of problem isn't uncommon in Javascript - the most common tool for stubbing the system time causes hangs in many libraries, including Express, Request, async, Bluebird, and Co. These types of issues are incredibly difficult to debug, because when things hang, or fail to hit callbacks, you have virtually no insight into what the stack looks like, or what callback failed to get hit. We could add console.log lines but you have to have some idea what we were looking for. Furthermore annotating every single library and every single query would have caused readability problems elsewhere.

Run the tests forever

Our CI provider lets us SSH into a host, but you have to enable SSH access before the build completes, which isn't practical when 1 in 100 builds fails. My normal approach to debugging intermittent test failures is to run the tests in a continuous loop until they fail, and then inspect the test logs/SSH to the host and look for... something, until you figured out what was going on. Unfortunately that wasn't triggering it often enough either, and even when I did trigger it, our CI provider shuts down SSH access after 30 minutes, and I wasn't fast enough to notice/investigate.

Instead we had to ship whatever logs we needed as artifacts from the test. This required turning on database logging for Postgres, then copying logs to the directory where artifacts get exported. Database logs across multiple containers shared the same filename, which meant only database logs from the first container became available as artifacts, so we also had to add a unique ID to each log file so they'd get exported properly.

This was a lot of steps, and took just enough debugging/setup time that I procrastinated for a while, hoping a better stack trace or error message might reveal itself, or that the problem would go away.

Here are the Postgres settings we enabled:

logging_collector = on
log_statement = 'all'
log_duration = on
# Log if we are waiting on a lock.
deadlock_timeout = 1s
log_lock_waits = on
# Ensure all logs are in one file.
log_rotation_size = 200MB
log_line_prefix = '%m [%p-%c-%l] %e %q%u@%d '

Finally we found something!

20:38:04.947 LOG: process 16877 still waiting for AccessExclusiveLock on relation 16789 of database 16387 after 1000.113 ms
20:38:04.947 DETAIL: Process holding the lock: 16936. Wait queue: 16877.
20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms
20:38:23.141 ERROR: canceling autovacuum task
20:38:23.141 CONTEXT: automatic vacuum of table "circle_test.public.events"
20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms

It looks like, unbeknownst to us, autovacuum was running in the background, and conflicting with the multi-statement ALTER TABLE query, in a way that each one was waiting for the other to complete. If you get deep enough in the weeds of Postgres blog posts, you find that stuck VACUUMs can and do happen.

Although it requires a fairly unusual combination of circumstances, a stuck VACUUM can even lead to an undetected deadlock situation, as Tom Lane recently pointed out. If a process that has a cursor open on a table attempts to take an exclusive lock on that table while it's being vacuumed, the exclusive lock request and the vacuum will both block until one of them is manually cancelled.

Not out of the woods

Unfortunately, we continued to see deadlocks. In some cases, a test would continue to make background database queries after they had reported completion. These would interact poorly with our ALTER TABLE commands, and deadlock, leading to a timeout. I posted on the Postgresql mailing list about this, and Alexey Bashtanov suggested a different approach. Instead of running ALTER TABLE, start a transaction and defer enforcement of foreign key constraints until the COMMIT. That way we could run the DELETE FROM's in any order and they'd still work.

BEGIN; SET CONSTRAINTS ALL DEFERRED;
DELETE FROM tableA;
DELETE FROM tableB;
COMMIT

This approach also sped up that query! It now runs in about 10ms, versus 20ms before.

Not out of the woods, again

We thought we'd nipped it in the bud, and certainly we were happy with the performance improvement. Unfortunately we've still been seeing timeouts; less frequently, but maybe once or twice a week. They usually happen when INSERTing records into an empty table; Postgres just hangs for 15 seconds, until the statement_timeout kicks in and Postgres kills the query, which fails the build. There are no messages about locks, so we're not sure what's going on. This is pretty uncharacteristic behavior for Postgres, and we're not doing anything particularly crazy with queries. We're open to suggestions; I posted about it again on the Postgres mailing list but did not get any good feedback.

Conclusion

The amount of work that's gone into tracking this failure down is depressing, though it's surprising that Javascript isn't the culprit for intermittent test failures. It would be nice if our CI provider had a way to auto-enable SSH on failed builds, enabled Postgres logging by default, or disabled autovacuum by default. I'd also recommend if you're adding foreign key constraints to make them DEFERRABLE INITIALLY IMMEDIATE; this gives you the most flexibility to pursue strategies like we did.

Liked what you read? I am available for hire.

The Frustration and Loneliness of Server-Side Javascript Development

Posted on May 2, 2016

I was hired in December 2014 as the sixth engineer at Shyp. Shyp runs Node.js on the server. It's been a pretty frustrating journey, and I wanted to share some of the experiences we've had. There are no hot takes about the learning curve, or "why are there so many frameworks" in this post.

Initially we were running on Node 0.10.30. Timers would consistently fire 50 milliseconds early - that is, if you called setTimeout with a duration of 200 milliseconds, the timer would fire after 150 milliseconds. I asked about this in the #Node.js Freenode channel, and was told it was my fault for using an "ancient" version of Node (it was 18 months old at that time), and that I should get on a "long term stable" version. This made timeout-based tests difficult to write - every test had to set timeouts longer than 50ms.
I wrote a metrics library that published to Librato. I expect a background metric publishing daemon to silently swallow/log errors. One day Librato had an outage and returned a 502 to all requests; unbeknownst to me the underlying librato library we were using was also an EventEmitter, that crashed the process if unhandled. This caused about 30 minutes of downtime in our API.
The advice on the Node.js website is to crash the process if there's an uncaught exception. Our application is about 60 models, 30 controllers; it doesn't seem particularly large. It consistently takes 40 seconds to boot our server in production; the majority of this time is spent requiring files. Obviously we try not to crash, and fix crashes when we see them, but we can't take 40 seconds of downtime because someone sent an input we weren't expecting. I asked about this on the Node.js mailing list but got mostly peanuts. Recently the Go core team sped up build times by 2x in a month.
We discovered our framework was sleeping for 50ms on POST and PUT requests for no reason. I've previously written about the problems with Sails, so I am going to omit most of that here.
We upgraded to Node 4 (after spending a day dealing with a nasty TLS problem) and observed our servers were consuming 20-30% more RAM, with no accompanying speed, CPU utilization, or throughput increase. We were left wondering what benefit we got from upgrading.
It took about two months to figure out how to generate a npm shrinkwrap file that produced reliable changes when we tried to update it. Often attempting to modify/update it would change the "from" field in every single dependency.
The sinon library appears to be one of the most popular ways to stub the system time. The default method for stubbing (useFakeTimers) leads many other libraries to hang inexplicably. I noticed this after spending 90 minutes wondering why stubbing the system time caused CSV writing to fail. The only ways to debug this are 1) to add console.log statements at deeper and deeper levels, since you can't ever hit ctrl+C and print a stack trace, or 2) take a core dump.
The library we used to send messages to Slack crashed our server if/when Slack returned HTML instead of JSON.
Our Redis provider changed something - we don't know what, since the version number stayed the same - in their Redis instance, which causes the Redis connection library we use to crash the process with a "Unhandled event" message. We have no idea why this happens, but it's tied to the number of connections we have open - we've had to keep a close eye on the number of concurrent Redis connections, and eventually just phased out Redis altogether.
Our database framework doesn't support transactions, so we had to write our own transaction library.
The underlying database driver doesn't have a way to time out/cancel connection acquisition, so threads will wait forever for a connection to become available. I wrote a patch for this; it hasn't gotten any closer to merging in 10 weeks, so I published a new NPM package with the patch, and submitted that upstream.
I wrote a job queue in Go. I couldn't benchmark it using our existing Node server as the downstream server, the Node server was too slow - a basic Express app would take 50ms on average to process incoming requests and respond with a 202, with 100 concurrent in-flight requests. I had to write a separate downstream server for benchmarking.
Last week I noticed our integration servers were frequently running out of memory. Normally they use about 700MB of memory, but we saw they would accumulate 500MB of swap in about 20 seconds. I think the app served about thirty requests total in that time frame. There's nothing unusual about the requests, amount of traffic being sent over HTTP or to the database during this time frame; the amount of data was in kilobytes. We don't have tools like pprof. We can't take a heap dump, because the box is out of memory by that point.

Summary

You could argue - and certainly you could say this about Sails - that we chose bad libraries, and we should have picked better. But part of the problem with the Node ecosystem is it's hard to distinguish good libraries from bad.

I've also listed problems - pound your head, hours or days of frustrating debugging - with at least six different libraries above. At what point do you move beyond "this one library is bad" and start worrying about the bigger picture? Do the libraries have problems because the authors don't know better, or because the language and standard library make it difficult to write good libraries? I don't think it helps.

You could argue that we are a fast growing company and we'd have these problems with any language or framework we chose. Maybe? I really doubt it. The job queue is 6000 lines of Go code written from scratch. There was one really bad problem in about two months of development, and a) the race detector caught it before production, b) I got pointed to the right answer in IRC pretty quickly.

You could also argue we are bad engineers and better engineers wouldn't have these problems. I can't rule this out, but I certainly don't think this is the case.

The Javascript community gets a lot of credit for its work on codes of conduct, on creating awesome conferences, for creating a welcoming, thoughtful community. But the most overwhelming feeling I've gotten over the past year has been loneliness. I constantly feel like I am the only person running into these problems with libraries, these problems in production, or I'm the only person who cares about the consequences of e.g. waiting forever for a connection to become available. Frequently posting on Github or in IRC and getting no response, or getting told "that's the way it is". In the absence of feedback, I am left wondering how everyone else is getting along, or how their users are.

I've learned a lot about debugging Node applications, making builds faster, finding and fixing intermittent test failures and more. I am available to speak or consult at your company or event, about what I have learned. I am also interested in figuring out your solutions to these problems - I will buy you a coffee or a beer and chat about this at any time.

Liked what you read? I am available for hire.

Go Concurrency for Javascript Developers

Posted on April 6, 2016

Often the best way to learn a new language is to implement something you know
with it. Let’s take a look at some common async Javascript patterns, and how
you’d implement them in Go.

Callbacks

You can certainly implement callbacks in Go! Functions are first class
citizens. Here we make a HTTP request and hit the callback with the response
body or an error.

func makeRequest(method string, url string, cb func(error, io.ReadCloser)) {
	req, _ := http.NewRequest(method, url, nil)
	resp, err := http.DefaultClient.Do(req)
	if err != nil {
		cb(err, nil)
	} else {
		cb(err, resp.Body)
	}
}
func main() {
	makeRequest("GET", "http://ipinfo.io/json", func(err error, body io.ReadCloser) {
		defer body.Close()
		io.Copy(os.Stdout, body)
	})
}

Go uses callbacks in the standard library in some places – time.AfterFunc,
http.HandlerFunc. Other places callbacks are mostly used to allow you to
specify what function or action you want to take, in interaction with a useful
piece of code – filepath.Walk, strings.LastIndexFunc, etc.

Most of the same problems with callbacks apply to Go as Javascript, namely, you
can forget to call a callback, or you can call it more than once. At least you
can’t try to callback that is undefined in Go, since the compiler will not
let you do this.

But callbacks aren’t idiomatic. Javascript needs callbacks because there’s
only a single thread of execution, processing every request on your server
sequentially. If you "wait"/"block" for something to finish (e.g. by calling
fs.readFileSync or any of the Sync methods, every other request on your
server has to wait until you’re done. I’m simplifying, but you have to have
callbacks in Javascript so everything else (other incoming HTTP requests, for
example) have a chance to run.

Go does not have this same limitation. A Javascript thread can only run on one
CPU; Go can use all of the CPU’s on your computer, so multiple threads can
do work at the same time, even if all of them are making "blocking" function
calls, like HTTP requests, database queries, or reading or writing files.

Furthermore, even if you only had one CPU, multiple Go threads would run
without having IO "block" the scheduler, even though methods like http.Get
have a synchronous interface. How does this work? Deep down in code like
net/fd_unix.go, the socket code calls syscall.Write, which (when you follow
it far enough) calls runtime.entersyscall, which signals to the scheduler
that it won’t have anything to do for a while, and other work should
proceed.

For that reason you probably don’t want to use callbacks for asynchronous code.

.then / .catch / .finally

This is the other common approach to async operations in Javascript.
Fortunately most API’s in Go are blocking, so you can just do one thing and
then the other thing. Say you want to execute the following async actions:

var changeEmail = function(userId, newEmail) {
    return checkEmailAgainstBlacklist(newEmail).then(function() {
        return getUser(userId);
    }).then(function(user) {
        user.email = newEmail;
        return user.save();
    }).catch(function(err) {
        console.error(err);
        throw err;
    });
}

In Go, you’d just do each action in turn and wait for them to finish:

func changeEmail(userId string, newEmail string) (User, error) {
    err := checkEmailAgainstBlacklist(newEmail)
    if err != nil {
        return nil, err
    }
    user, err := getUser(userId)
    if err != nil {
        return nil, err
    }
    user.Email = newEmail
    return user.Save()
}

.finally is usually implemented with the defer keyword; you can at any time
defer code to run at the point the function terminates.

resp, err := http.Get("http://ipinfo.io/json")
// Note Close() will panic if err is non-nil, you still have to check err
checkError(err)
defer resp.Body.Close()
io.Copy(os.Stdout, resp.Body)

Promise.try

Just ignore the error response, or handle it; there’s no real difference
between sync and async errors, besides calling panic, which your code
shouldn’t be doing.

Promise.join

Doing multiple things at once in Javascript is pretty easy with Promises.
Here we dispatch two database queries at once, and handle the results in one
function.

var checkUserOwnsPickup = function(userId, pickupId) {
    return Promise.join(
        getUser(userId),
        getPickup(pickupId)
    ).spread(function(user, pickup) {
        return pickup.userId === user.id;
    })
}

Unfortunately doing this in Go is a little more complicated. Let’s start with
a slightly easier case – doing something where we don’t care about the result,
whether it errors or not. You can put each in a goroutine and they’ll execute
in the background.

func subscribeUser(user User) {
    go func() {
        sendWelcomeEmail(user.Email)
    }()
    go func() {
        metrics.Increment('user.signup')
    }()
}

The easiest way to pass the results back is to declare the variables in the
outer scope and assign them in a goroutine.

func main() {
    var user models.User
    var userErr error
    var pickup models.Pickup
    var pickupErr error
    var wg sync.WaitGroup

    wg.Add(2)
    go func() {
        user, userErr = getUser(userId)
        wg.Done()
    }()
    go func() {
        pickup, pickupErr = getPickup(pickupId)
        wg.Done()
    }()
    wg.Wait()
    // Error checking omitted
    fmt.Println(user.Id)
    fmt.Println(pickup.Id)
}

More verbose than Promise.join, but the lack of generics primitives makes a
shorter solution tough. You can gain more control, and early terminate by using
channels, but it’s more verbose to do so.

Promise.map

Do the same as the Go version of Promise.join above, and then write a for
loop to loop over the resolved values. This might be a little trickier because
you need to specify the types of everything, and you can’t create arrays with
differing types. If you want a given concurrency value, there are some neat
code examples in this post on Pipelines.

Checking for races

One of the strengths of Go is the race detector, which detects whether
two threads can attempt to read/write the same variable in different orders
without synchronizing first. It’s slightly slower than normal, but I highly
encourage you to run your test suite with the race detector on. Here’s our
make target:

test:
    go test -race ./... -timeout 2s

Once it found a defect in our code that turned out to be a standard library
error! I highly recommend enabling the race detector.

The equivalent in Javascript would be code that varies the event loop each
callback runs in, and checks that you still get the same results.

Thanks

I hope this helps! I wrote this to be the guide I wish I had when I was getting
started figuring out how to program concurrently in Go.

Thanks to Kyle Conroy, Alan Shreve, Chris
Goddard, @rf, and members of the #go-nuts IRC channel for
answering my silly questions.

Liked what you read? I am available for hire.

Encoding and Decoding JSON, with Go’s net/http package

Posted on February 11, 2016

This is a pretty common task: encode JSON and send it to a server, decode JSON on the server, and vice versa. Amazingly, the existing resources on how to do this aren't very clear. So let's walk through each case, for the following simple User object:

type User struct{
    Id      string
    Balance uint64
}

Sending JSON in the body of a POST/PUT request

Let's start with the trickiest one: the body of a Go's http.Request is an io.Reader, which doesn't fit well if you have a struct - you need to write the struct first and then copy that to a reader.

func main() {
	u := User{Id: "US123", Balance: 8}
	b := new(bytes.Buffer)
	json.NewEncoder(b).Encode(u)
	res, _ := http.Post("https://httpbin.org/post", "application/json; charset=utf-8", b)
	io.Copy(os.Stdout, res.Body)
}

Decoding JSON on the server

Let's say you're expecting the client to send JSON data to the server. Easy, decode it with json.NewDecoder(r.Body).Decode(&u). Here's what that looks like with error handling:

func main() {
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		var u User
		if r.Body == nil {
			http.Error(w, "Please send a request body", 400)
			return
		}
		err := json.NewDecoder(r.Body).Decode(&u)
		if err != nil {
			http.Error(w, err.Error(), 400)
			return
		}
		fmt.Println(u.Id)
	})
	log.Fatal(http.ListenAndServe(":8080", nil))
}

Encoding JSON in a server response

Just the opposite of the above - call json.NewEncoder(w).Encode(&u) to write JSON to the server.

func main() {
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		u := User{Id: "US123", Balance: 8}
		json.NewEncoder(w).Encode(u)
	})
	log.Fatal(http.ListenAndServe(":8080", nil))
}

Reading a JSON response from the server.

This time you're going to read the response body in the client, after making the request.

func main() {
	u := User{Id: "US123", Balance: 8}
	b := new(bytes.Buffer)
	json.NewEncoder(b).Encode(u)
	res, _ := http.Post("https://httpbin.org/post", "application/json; charset=utf-8", b)
	var body struct {
		// httpbin.org sends back key/value pairs, no map[string][]string
		Headers map[string]string `json:"headers"`
		Origin  string            `json:"origin"`
	}
	json.NewDecoder(res.Body).Decode(&body)
	fmt.Println(body)
}

That's it! I hope it helps you a lot. Note that we only had to encode/decode the response to a byte array one time - in every other case we passed it directly to the reader/writer, which is one of the really nice things about Go, and interfaces - in most cases it's really easy to pass around streams, instead of having to deal with the intermediate steps.

Liked what you read? I am available for hire.

How to push to multiple Github accounts from the same machine

Posted on February 8, 2016

You might want/need to check out repos locally as several different Github users. For example, my Github account is kevinburke, but I push code at work as kevinburkeshyp. If I could only commit/push as one of those users on each machine, or I had to manually specify the SSH key to use, it would be a big hassle!

Generate some SSH keys

First generate SSH keys for each account, and then upload the public keys to the right Github accounts.

ssh-keygen -o -a 100 -t ed25519 -f "$HOME/.ssh/kevinburke_rsa"
ssh-keygen -o -a 100 -t ed25519 -f "$HOME/.ssh/kevinburkeshyp_rsa"

Set up SSH host aliasing

You're going to use your standard SSH config for the host you use more often, and an alias for the one you use less often. Here's my SSH configuration for the host I use more often - on my work laptop, it's my work account.

Host github.com
    IdentityFile ~/.ssh/kevinburkeshyp_rsa
    HostName github.com
    User kevinburkeshyp

Now, set up an alias for your second account.

Host github.com-kevinburke
    IdentityFile ~/.ssh/kevinburke_rsa
    HostName github.com
    User kevinburke

The Host is set to github.com-kevinburke - this is what you write into your SSH config locally. The HostName is what's used for DNS resolution, which is how this still works. Note you'll want to switch the IdentityFile and the User with the key and usernames you set when you generated the keys.

Now if you want to clone something into the second account, use the special host alias:

git clone git@github.com-kevinburke:kevinburke/hamms.git

That's it! It should just work.

Which Commit Name/Email Address?

Most people configure their commit email address using the following command:

$ git config --global user.email test@example.com

This sets your email address in $HOME/.gitconfig and applies that value to every repository/commit on your system. Obviously this is a problem if you have two different email addresses! I want to use burke@shyp.com with my Shyp account and kev@inburke.com with my personal account.

I fix this by putting an empty email in the global $HOME/.gitconfig:

[user]
    name = Kevin Burke
    email = ""

When I try a commit in a new repository, Git shows the following message:

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'kevin@rhino.(none)')

Which is good! It prevents me from committing until I've configured my email address. I then have helpers set up to easily configure this.

alias setinburkeemail="git config user.email 'kev@inburke.com'"

That's it! What tips do you use for managing two different Github accounts?

Liked what you read? I am available for hire.

Just Return One Error

Posted on January 14, 2016

Recently there's been a trend in API's to return more than one error, or to always return an array of errors, to the client, when there's a 400 or a 500 server error. From the JSON API specification:

Error objects MUST be returned as an array keyed by errors in the top level of a JSON API document.

Here's an example.

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/vnd.api+json

{
  "errors": [
    {
      "status": "422",
      "source": { "pointer": "/data/attributes/first-name" },
      "title":  "Invalid Attribute",
      "detail": "First name must contain at least three characters."
    }
  ]
}

I've also seen it in model validation; for example, Waterline returns an array of errors in the invalidAttributes field of the returned error object.

I think this is overkill, and adds significant complexity to your application for little benefit. Let's look at why.

Clients are rarely prepared to show multiple errors. At Shyp, we returned an errors key from the API on a 400 or 500 error for about 18 months. All four of the API's clients - the iOS app, the Android app, our error logger, and our internal dashboard - simply processed this by parsing and displaying errors[0], and ignoring the rest. It makes sense from the client's perspective - the errors were being shown on a phone (not much room) or an alert box in the browser (show two, and you get the "Prevent this page from displaying more alerts" dialog). But if clients can't make use of multiple errors, it makes me wonder why we return them from the API.
You rarely catch every problem with a request. Say a pickup is 1) missing the to address, 2) missing the city, 3) shipping to a country we don't ship to, and 4) contains a photo of a firearm, which we can't ship. If we return an error, it's going to be ['Missing City', 'Missing Address Line 1']. There's no way we're going to submit your invalid address to our rates service and discover even more problems with the customs form, or try to parse the photo and learn that you're trying to ship a firearm. So even if you fix the first errors, we're still going to return you a 400 from the API.
Continuing to process the request is dangerous. Attaching multiple errors to an API response implies that you discovered a problem with an incoming request and continued processing it. This is dangerous! Every part of your system needs to be prepared to handle invalid inputs. This might be a useful defensive measure, but it's also a lot of work for every function in your system to avoid making assumptions about the input that's passed to it. In my experience, passing around bad inputs leads to unnecessary server errors and crashes.
Programming languages and libraries return one error. Generally if you call raise or throw, you're limited to throwing a single Error object. If you're using a third party library that can error, usually library functions are limited to returning one error. Returning a single error from your API or your model validation function aligns with 99% of error handling paradigms that currently exist.

A Caveat

There's one place where returning multiple errors may make sense - a signup form with multiple form fields, where it can be frustrating to only return one problem at a time. There are two ways to deal with this:

Perform validation on the client, and detect/display multiple errors there. It's a better user experience to do this, since feedback is more or less instantaneous. You'll need the validation in your API anyway, but if someone manages to bypass the client checks, it's probably fine to just return one error at a time.
Special case your returned error with a key like attributes, and return the necessary error messages, just for this case. I'd rather special case this error message, than force every other error to be an array just to handle this one place in the API.

Liked what you read? I am available for hire.

Safely Moving a Large Shrinkwrapped Dependency

Posted on December 26, 2015

This past week the team at Shyp decided to fork the framework we use (Sails.js) and the ORM it uses (Waterline). It was an interesting exercise in JS tooling, and I wasted a lot of time on busywork, so I thought I'd share a little bit about the process.

Why Fork?

We've been on an outdated fork of Sails/Waterline for a while. We chose not to update because we were worried that updates would introduce new errors, and many of the updates were new features that we didn't want or need.

The recent ugly discussions between the former CEO of Balderdash and other core contributors to Sails/Waterline made us rethink our dependency on the core tools. We concluded upgrading our framework was going to always be unlikely. We also figured we could remove the many parts of the framework that we don't want or need (sockets, cookies, blueprints, the horribly insecure "action routes", CSRF, Mongo support, Grunt, sails generate, sails www, the multi-query methods that aren't safe against race conditions, &c), and apply bugfixes to parts of the code that we knew were broken. Fortunately the changes already started to pay for themselves at the end of an 8 hour day, as I'll explain later.

Rollout Plan

On its face, the rollout plan was pretty simple.

Fork Sails to Shyp/sails.
Update the HEAD of the master branch to point at the version we have installed in production.
Point the Sails dependency at our fork (no code changes).
Make changes that we want.
Update the dependency.

Breakdowns

This broke down somewhere around (3), for a few reasons.

If I tried to npm install github.com/Shyp/sails.git, every single package Sails depended on, and all of their sub-dependencies, would update. We didn't want this, for several reasons - I don't trust package maintainers to stay on top of updates, or predict when changes/updates will break code, and package updates amount to a large amount of unvetted code being deployed into your API without any code review.

I didn't know how to work around this so I punted on it by hard-coding the update in the npm shrinkwrap file.
My next attempt was to try to generate a shrinkwrap file for Sails, and give our Sails fork the same versions of every dependency & sub-dependency we had in our API. This was the right idea, but failed in its implementation. Sails dependencies are currently alphabetized, but they weren't in the version that we were running. I alphabetized the package.json list while generating a shrinkwrap file, which changed the dependency order in the Shyp/sails shrinkwrap file. This made it impossible to diff against the shrinkwrap file in our API, which listed Sails's dependencies in an arbitrary order.
Finally I figured out what needed to happen. First, alphabetize the Sails dependency list in the Shyp API repository. Once that's done, I could use that shrinkwrap file as the basis for the shrinkwrap file in the Sails fork, so the dependencies in the Sails project matched up with the Shyp dependencies. Specifically, I accomplished this by sorting the dependencies in the package.json file, placing the sorted file in api/node_modules/sails/package.json, and then re-running clingwrap with the same dependencies installed. I then verified the same version of each dependency was installed by sorting the JSON blob in the old and new shrinkwrap files, and ensuring that the diff was empty.

Once that was done, I could update dependencies in the (alphabetized) fork, and the diff in Shyp API's (now alphabetized) npm-shrinkwrap.json would be small. It was great to finally be able to generate a small diff in npm-shrinkwrap.json, and verify that it was correct.

Compounding the frustration was that each build took about 2 minutes, to wipe and reinstall all versions of every dependency. It was a slow afternoon that got more frustrating as it wore on.

Notes

Before adding or attempting to move any large NPM dependency, ensure that the dependencies are in alphabetical order. You'll run into too many headaches if this is not the case, and you won't be able to regenerate the npm-shrinkwrap file or be able to reason about it.
It's surprising that the builtin build tools don't enforce alphabetical order.
NPM's shrinkwrap tool does not generate the same file every time you run it. Specifically, if you install a package, occasionally the shrinkwrap file will report the "from" string is a range, for example, "from": "colors@>=0.6.0-1 <0.7.0", and sometimes it will report the "from" string as a npmjs.com URL. For a while, we worked around this by generating the NPM shrinkwrap file twice - the second time we ran it, it would generate a consistent set of strings.
```
  rm -rf node_modules
  npm cache clear
  npm install --production
  npm shrinkwrap
  npm install --production
  npm shrinkwrap
```
Eventually we switched to using clingwrap, which solves this problem by avoiding the "from" and "resolved" fields entirely. We've also heard NPM 3 contains updates to the shrinkwrap file, but NPM 3 also believes our dependency tree contains errors, so we've been unable to effectively test it. Clingwrap also has some errors with NPM 3 and github repositories.
It was incredibly tempting to just say "fuck it" and deploy the updated version of every dependency and subdependency. Based on the amount of work it took me to figure out how to generate and check in consistent diffs, I came away thinking that there are a lot of Javascript applications out there that are unable to control/stabilize the versions of libraries that they use.
These are all problems you can avoid or mitigate by thinking incredibly hard before taking on a dependency, especially if, like Sails, the tool you're leaning on has a lot of its own dependencies. Is there really an advantage to using _.first(arr) instead of arr[0]? Do you really need the pluralize library, or is it sufficient to just append an 's' for your use case? Do you need to take a dependency to build a state machine, or can you use a dictionary and an UPDATE query?

Benefits

This took a lot of time to get right, and felt like busywork for at least three of the four hours I spent working on this, but we finally got some benefits out of this by the end of the day.

I committed a change to immediately error if anyone attempts to use the count() method, which attempts to pull the entire table into memory. Deploying this in our API application revealed four places where we were calling count()! Fortunately, they were all in test code. Still, it's delightful to be able to automatically enforce a rule we previously had to rely on code review (and shared team memory) to catch.
We removed the Grunt dependency, which we never used. There are about 4500 lines of deletions waiting in pull requests.
We understand shrinkwrap a little better, and hopefully this process won't be so painful next time.
We removed some incorrect Mongo-specific code from Waterline. This is the type of adaptation that we can only do by giving up compatibility, and we're happy with how easy this is now.

So that was my Wednesday last week. As it stands, we're planning on developing this fork in the open. I don't think you should use it - there are still better alternatives - but if you're stuck on Waterline with Postgres, or using Sails as a REST API, you might want to take a look.

Liked what you read? I am available for hire.