This is the story about an error that caused our tests to fail maybe one out of one hundred builds. Recently we figured out why this happened and deployed code to fix it.
To ensure each test has a clean slate, we clear the database between each test.
Normally you clear tables by running
TRUNCATE TABLE table1 table2 table3 table4 CASCADE. Unfortunately this takes about 200ms, which is too slow when
you want to run it between every test - it would double the runtime of our test
DELETE FROM runs much faster, but if you have records in table A that
reference table B,
DELETE FROM tableB will fail, since records in A still
point to it. We could draw a dependency graph between our 60 tables and issue
the DELETEs in the correct order, but this seems painful to maintain as we
add more constraints. Instead our contractor James Cropcho found a solution -
disable the constraints, run the
DELETE, then re-enable the constraints. This
ran in about 20ms.
ALTER TABLE tableA DISABLE TRIGGER ALL; DELETE FROM tableA; ALTER TABLE tableA ENABLE TRIGGER ALL;
With this error we didn't get any of those. We also observed it could happen anywhere - it didn't seem to correlate with any individual test or test file.
Where to start looking
Run the tests forever
Our CI provider lets us SSH into a host, but you have to enable SSH access before the build completes, which isn't practical when 1 in 100 builds fails. My normal approach to debugging intermittent test failures is to run the tests in a continuous loop until they fail, and then inspect the test logs/SSH to the host and look for... something, until you figured out what was going on. Unfortunately that wasn't triggering it often enough either, and even when I did trigger it, our CI provider shuts down SSH access after 30 minutes, and I wasn't fast enough to notice/investigate.
Instead we had to ship whatever logs we needed as artifacts from the test. This required turning on database logging for Postgres, then copying logs to the directory where artifacts get exported. Database logs across multiple containers shared the same filename, which meant only database logs from the first container became available as artifacts, so we also had to add a unique ID to each log file so they'd get exported properly.
This was a lot of steps, and took just enough debugging/setup time that I procrastinated for a while, hoping a better stack trace or error message might reveal itself, or that the problem would go away.
Here are the Postgres settings we enabled:
logging_collector = on log_statement = 'all' log_duration = on # Log if we are waiting on a lock. deadlock_timeout = 1s log_lock_waits = on # Ensure all logs are in one file. log_rotation_size = 200MB log_line_prefix = '%m [%p-%c-%l] %e %q%u@%d '
Finally we found something!
20:38:04.947 LOG: process 16877 still waiting for AccessExclusiveLock on relation 16789 of database 16387 after 1000.113 ms 20:38:04.947 DETAIL: Process holding the lock: 16936. Wait queue: 16877. 20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms 20:38:23.141 ERROR: canceling autovacuum task 20:38:23.141 CONTEXT: automatic vacuum of table "circle_test.public.events" 20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms
It looks like, unbeknownst to us, autovacuum was running in the background, and conflicting with the multi-statement ALTER TABLE query, in a way that each one was waiting for the other to complete. If you get deep enough in the weeds of Postgres blog posts, you find that stuck VACUUMs can and do happen.
Although it requires a fairly unusual combination of circumstances, a stuck VACUUM can even lead to an undetected deadlock situation, as Tom Lane recently pointed out. If a process that has a cursor open on a table attempts to take an exclusive lock on that table while it's being vacuumed, the exclusive lock request and the vacuum will both block until one of them is manually cancelled.
Not out of the woods
Unfortunately, we continued to see deadlocks. In some cases, a test would continue to make background database queries after they had reported completion. These would interact poorly with our ALTER TABLE commands, and deadlock, leading to a timeout. I posted on the Postgresql mailing list about this, and Alexey Bashtanov suggested a different approach. Instead of running ALTER TABLE, start a transaction and defer enforcement of foreign key constraints until the COMMIT. That way we could run the DELETE FROM's in any order and they'd still work.
BEGIN; SET CONSTRAINTS ALL DEFERRED; DELETE FROM tableA; DELETE FROM tableB; COMMIT
This approach also sped up that query! It now runs in about 10ms, versus 20ms before.
Not out of the woods, again
We thought we'd nipped it in the bud, and certainly we were happy with the
performance improvement. Unfortunately we've still been seeing timeouts; less
frequently, but maybe once or twice a week. They usually happen when INSERTing
records into an empty table; Postgres just hangs for 15 seconds, until the
statement_timeout kicks in and Postgres kills the query, which fails the
build. There are no messages about locks, so we're not sure what's going on.
This is pretty uncharacteristic behavior for Postgres, and we're not doing
anything particularly crazy with queries. We're open to suggestions; I posted
about it again on the Postgres mailing list but did not get any good
The amount of work that's gone into tracking this failure down is depressing,
failures. It would be nice if our CI provider had a way to auto-enable SSH on
failed builds, enabled Postgres logging by default, or disabled autovacuum by
default. I'd also recommend if you're adding foreign key constraints to make
DEFERRABLE INITIALLY IMMEDIATE; this gives you the most flexibility to
pursue strategies like we did.
Liked what you read? I am available for hire.