Author Archives: kevin

About kevin

I write the posts

Everything you need to know about state machines

Posted on November 17, 2015

Do you have objects in your system that can be in different states (accounts, invoices, messages, employees)? Do you have code that updates these objects from one state to another? If so, you probably want a state machine.

What is a state machine?

At its root, a state machine defines the legal transitions between states in your system, is responsible for transitioning objects between states, and prevents illegal transitions.

You sound like an architecture astronaut, why do I need this?

Let's talk about some bad things that can happen if you don't have a state machine in place.

(Some of these actually happened! Some are invented.)

A user submits a pickup. We pick up the item and ship it out. Two weeks later, a defect causes the app to resubmit the same pickup, and reassign a driver, for an item that's already been shipped.
Two users submit a pickup within a second of each other. Our routing algorithm fetches available drivers, computes each driver's distance to the pickup, and says the same driver is available for both pickups. We assign the same driver to both pickups.
A user submits a pickup. A defect in a proxy causes the submit request to be sent multiple times. We end up assigning four drivers to the pickup, and sending the user four text messages that their pickup's been assigned.
An item is misplaced at the warehouse and sent straight to the packing station. Crucial steps in the shipping flow get skipped.
Code for updating the state of an object is littered between several different classes and controllers, which handle the object in different parts of its lifecycle. It becomes difficult to figure out how the object moves between various states. Customer support tells you that an item is in a particular state that makes it impossible for them to do their jobs. It takes great effort to figure out how it got there.

These are all really bad positions to be in! A lot of pain for you and a lot of pain for your teams in the field.

You are already managing state (poorly)

Do you have code in your system that looks like this?

def submit(pickup_id):
    pickup = Pickups.find_by_id(pickup_id)
    if pickup.state != 'DRAFT':
        throw new StateError("Can't submit a pickup that's already been submitted")
    pickup.state = 'SUBMITTED'
    pickup.save()
    MessageService.send_message(pickup.user.phone_number, 'Your driver is on the way!')

By checking the state of the pickup before moving to the next state, you're managing the state of your system! You are (at least partially) defining what transitions are allowed between states, and what transitions aren't. To avoid the issues listed above, you'll want to consolidate the state management in one place in your codebase.

Okay, how should I implement the state machine?

There are a lot of libraries that promise to manage this for you. You don't need any of them (too much complexity), and you don't need a DSL. You just need a dictionary and a single database query.

The dictionary is for defining transitions, allowable input states, and the output state. Here's a simplified version of the state machine we use for Pickups.

states = {
    submit: {
        before: ['DRAFT'],
        after:   'SUBMITTED',
    },
    assign: {
        before: ['SUBMITTED'],
        after:   'ASSIGNED',
    },
    cancel: {
        before: ['DRAFT', 'SUBMITTED', 'ASSIGNED'],
        after:   'CANCELED',
    },
    collect: {
        before: ['ASSIGNED'],
        after:   'COLLECTED',
    },
}

Then you need a single function, transition, that takes an object ID, the name of a transition, and (optionally) additional fields you'd like to set on the object if you update its state.

The transition function looks up the transition in the states dictionary, and generates a single SQL query:

UPDATE table SET
    state = 'newstate',
    extraField1 = 'extraValue1'
WHERE
    id = $1 AND
    state IN ('oldstate1', 'oldstate2')
RETURNING *

If your UPDATE query returns a row, you successfully transitioned the item! Return the item from the function. If it returns zero rows, you failed to transition the item. It can be tricky to determine why this happened, since you don't know which (invalid) state the item was in that caused it to not match. We cheat and fetch the record to give a better error message - there's a race there, but we note the race in the error message, and it gives us a good idea of what happened in ~98% of cases.

Note what you don't want to do - you don't want to update the object in memory and then call .save() on it. Not only is .save() dangerous, but fetching the item before you attempt to UPDATE it means you'll be vulnerable to race conditions between two threads attempting to transition the same item to two different states (or, twice to the same state). Ask, don't tell - just try the transition and then handle success or failure appropriately.

Say you send a text message to a user after they submit their pickup - if two threads can successfully call the submit transition, the user will get 2 text messages. The UPDATE query above ensures that exactly one thread will succeed at transitioning the item, which means you can (and want to) pile on whatever only-once actions you like (sending messages, charging customers, assigning drivers, &c) after a successful transition and ensure they'll run once. For more about consistency, see Weird Tricks to Write Faster, More Correct Database Queries.

Being able to issue queries like this is one of the benefits of using a relational database with strong consistency guarantees. Your mileage (and the consistency of your data) may vary when attempting to implement a state transition like this using a new NoSQL database. Note that with the latest version of MongoDB, it's possible to read stale data, meaning that (as far as I can tell) the WHERE clause might read out-of-date data, and you can apply an inconsistent state transition.

Final Warnings

A state machine puts you in a much better position with respect to the consistency of your data, and makes it easy to guarantee that actions performed after a state transition (invoicing, sending messages, expensive operations) will be performed exactly once for each legal transition, and will be rejected for illegal transitions. I can't stress enough how often this has saved our bacon.

You'll still need to be wary of code that makes decisions based on other properties of the object. For example, you might set a driver_id on the pickup when you assign it. If other code (or clients) decide to make a decision based on the presence or absence of the driver_id field, you're making a decision based on the state of the object, but outside of the state machine framework, and you're vulnerable to all of the bullet points mentioned above. You'll need to be vigilant about these, and ensure all calling code is making decisions based on the state property of the object, not any auxiliary properties.

You'll also need to be wary of code that tries a read/check-state/write pattern; it's vulnerable to the races mentioned above. Always always just try the state transition and handle failure if it fails.

Finally, some people might try to sneak in code that just updates the object state, outside of the state machine. Be wary of this in code reviews and try to force all state updates to happen in the StateMachine class.

Liked what you read? I am available for hire.

Weird Tricks to Write Faster, More Correct Database Queries

Posted on October 12, 2015

Adrian Colyer wrote a great summary of a recent paper by Peter Bailis et al. In the paper the database researchers examine open source Rails applications and observe that the applications apply constraints - foreign key references, uniqueness constraints - in a way that's not very performant or correct.

I was pretty surprised to read about this! For the most part we have avoided problems like this at Shyp, and I didn't realize how widespread this problem is; I certainly have written a lot of bad queries in the past.

So! Let's learn some tips for writing better queries. Everything below will help you write an application that is more correct - it will avoid consistency problems in your data - and more performant - you should be able to achieve the same results as Rails, with fewer queries!

ps - The info below may be really obvious to you! Great! There are a lot of people who aren't familiar with these techniques, as the paper above indicates.

Use database constraints to enforce business logic

Say you define an ActiveRecord class that looks like this:

class User < ActiveRecord::Base
  validates :email, uniqueness: true
end

What actually happens when you try to create a new user? It turns out Rails will make 4 (four!) roundtrips to the database.

BEGIN a transaction.
Perform a SELECT to see if any other users have that email address.
If the SELECT turns up zero rows, perform an INSERT to add the row.
Finally, COMMIT the result.

This is pretty slow! It also increases the load on your application and your database, since you need to make 4 requests for every INSERT. Bailis et al also show that with your database's default transaction isolation level, it's possible to insert two records with the same key. Furthermore, there are some ActiveRecord queries which skip the built-in validations, as Gary Bernhardt discussed in his video, "Where Correctness Is Enforced", way back in 2012. Any query which inserts data and skips the validations can compromise the integrity of your database.

What if I told you you can do the same insert in one query instead of four, and it would be more correct than the Rails version? Instead of Rails's migration, write this:

CREATE TABLE users (email TEXT UNIQUE);

The UNIQUE is the key bit there; it adds a unique key on the table. Then, instead of wrapping the query in a transaction, just try an INSERT.

> insert into users (email) values ('foo@example.com');
INSERT 0 1
> insert into users (email) values ('foo@example.com');
ERROR:  duplicate key value violates unique constraint "users_email_key"
DETAIL:  Key (email)=(foo@example.com) already exists.

You'll probably need to add better error handling around the failure case - at least we did, for the ORM we use. But at any level of query volume, or if speed counts (and it probably does), it's worth it to investigate this.

Just Try the Write

Say you wanted to read a file. You could write this:

if not os.path.isfile(filename):
    raise ValueError("File does not exist")
with open(filename, 'r') as f:
    f.read()
    ...

But that would still be vulnerable to a race! What if the OS or another thread deleted the file between the isfile check and the with open line - the latter would throw an IOError, which won't be handled. Far better to just try to read the file and handle errors appropriately.

try:
    with open(filename, 'r') as f:
        f.read()
        ...
except IOError:
    raise ValueError("File does not exist")

Say you have a foreign key reference - phone_numbers.user_id refers to users.id, and you want to validate that the user_id is valid. You could do:

def write_phone_number(number, user_id):
    user = Users.find_by_id(user_id)
    if user is None:
        raise NotFoundError("User not found")
    Number.create(number=number, user_id=user_id)

Just try to write the number! If you have a foreign key constraint in the database, and the user doesn't exist, the database will tell you so. Otherwise you have a race between the time you fetch the user and the time you create the number.

def write_phone_number(number, user_id):
    try
        Number.create(number=number, user_id=user_id)
    except DatabaseError as e:
        if is_foreign_key_error(e):
            raise NotFoundError("Don't know that user id")

Updates Should Compose

Let's say you have the following code to charge a user's account.

def charge_customer(account_id, amount=20):
    account = Accounts.get_by_id(account_id)
    account.balance = account.balance - 20
    if account.balance <= 0:
        throw new ValueError("Negative account balance")
    else
        account.save()

Under the hood, here's what that will generate:

SELECT * FROM accounts WHERE id = ?
UPDATE accounts SET balance = 30 WHERE id = ?;

So far, so good. But what happens if two requests come in to charge the account at the same time? Say the account balance is $100

Thread 1 wants to charge $30. It reads the account balance at $100.
Thread 2 wants to charge $15. It reads the account balance at $100.
Thread 1 subtracts $30 from $100 and gets a new balance of $70.
Thread 2 subtracts $15 from $100 and gets a new balance of $85.
Thread 1 attempts to UPDATE the balance to $70.
Thread 2 attempts to UPDATE the balance to $85.

This is clearly wrong! The balance after $45 of charges should be $55, but it was $70, or $85, depending on which UPDATE went last. There are a few ways to deal with this:

create some kind of locking service to lock the row before the read and after you write the balance. The other thread will wait for the lock before it reads/writes the balance. These are hard to get right and will carry a latency penalty.
Run the update in a transaction; this will create an implicit lock on the row. If the transaction runs at the SERIALIZABLE or REPEATABLE READ levels, this is safe. Note most databases will set the default transaction level to READ COMMITTED, which won't protect against the issue referenced above.

Skip the SELECT and write a single UPDATE query that looks like this:

    UPDATE accounts SET balance = balance - 20 WHERE id = ?;

That last UPDATE is composable. You can run a million balance updates in any order, and the end balance will be exactly the same, every time. Plus you don't need a transaction or a locking service; it's exactly one write (and faster than the .save() version above!)

But if I do just one UPDATE, I can't check whether the balance will go below zero! You can - you just need to enforce the nonnegative constraint in the database, not the application.

CREATE TABLE accounts (
    id integer primary key,
    balance integer CHECK (balance >= 0),
);

That will throw any time you try to write a negative balance, and you can handle the write failure in the application layer.

Update: Apparently MySQL accepts check constraints as valid syntax, but does not execute them, so you might need to take a different approach. Thanks olivier for pointing this out!

The key point is that your updates should be able to run in any order without breaking the application. Use relative ranges - balance = balance - 20 for example - if you can. Or, only apply the UPDATE if the previous state of the database is acceptable, via a WHERE clause. The latter technique is very useful for state machines:

UPDATE pickups SET status='submitted' WHERE status='draft' AND id=?;

That update will either succeed (if the pickup was in draft), or return zero rows. If you have a million threads try that update at the same time, only one will succeed - an incredibly valuable property!

Beware of save()

The save() function in an ORM is really unfortunate for two reasons. First, to call .save(), you have to retrieve an instance of the object via a SELECT call. If you have an object's ID and some fields to read, you can avoid needing to do the read by just trying the UPDATE. This introduces more latency and the possibility of writing stale data.

Second, some implementations of .save() will issue an UPDATE and update every column.

This can lead to updates getting clobbered. Say two requests come in, one to update a user's phone number, and the other to update a user's email address, and both call .save() on the record.

UPDATE users SET email='oldemail@example.com', phone_number='newnumber' WHERE id = 1;
UPDATE users SET email='newemail@example.com', phone_number='oldnumber' WHERE id = 1;

In this scenario the first UPDATE gets clobbered, and the old email gets persisted. This is really bad! We told the first thread that we updated the email address, and then we overwrote it. Your users and your customer service team will get really mad, and this will be really hard to reproduce. Be wary of .save - if correctness is important (and it should be!), use an UPDATE with only the column that you want.

Partial Indexes

If you thought the previous section was interesting, check this out. Say we have a pickups table. Each pickup has a driver ID and a status (DRAFT, ASSIGNED, QUEUED, etc).

CREATE TABLE pickups (
    id integer,
    driver_id INTEGER REFERENCES drivers(id),
    status TEXT
);

We want to enforce a rule that a given driver can only have one ASSIGNED pickup at a time. You can do this in the application by using transactions and writing very, very careful code... or you can ask Postgres to do it for you:

CREATE UNIQUE INDEX "only_one_assigned_driver" ON pickups(driver_id) WHERE
    status = 'ASSIGNED';

Now watch what happens if you attempt to violate that constraint:

> INSERT INTO pickups (id, driver_id, status) VALUES (1, 101, 'ASSIGNED');
INSERT 0 1
> INSERT INTO pickups (id, driver_id, status) VALUES (2, 101, 'DRAFT');
INSERT 0 1 -- OK, because it's draft; doesn't hit the index.
> INSERT INTO pickups (id, driver_id, status) VALUES (3, 101, 'ASSIGNED');
ERROR:  duplicate key value violates unique constraint "only_one_assigned_driver"
DETAIL:  Key (driver_id)=(101) already exists.

We got a duplicate key error when we tried to insert a second ASSIGNED record! Because you can trust the database to not ever screw this up, you have more flexibility in your application code. Certainly you don't have to be as careful to preserve the correctness of the system, since it's impossible to put in a bad state!

Summary

In many instances your ORM may be generating a query that's both slow, and can lead to concurrency errors. That's the bad news. The good news is you can write database queries that are both faster and more correct!

A good place to start is by reversing the traditional model of ORM development. Instead of starting with the code in the ORM and working backwards to the query, start with the query you want, and figure out how to express than in your ORM. You'll probably end up using the lower-level methods offered by your ORM a lot more, and you'll probably discover defects in the way that your ORM handles database errors. That's okay! You are on a much happier path.

Thanks to Alan Shreve and Kyle Conroy for reading drafts of this post.

Liked what you read? I am available for hire.

Maybe Automatically Updating Dependencies Isn’t a Great Idea

Posted on September 26, 2015

There's a distressing feeling in the Node.js community that apps without up-to-date dependencies are somehow not as good, or stable, as apps that always keep their dependencies up to date. So we see things like greenkeeper.io and badges that show whether the project's dependencies are up to date (and, implicitly, shame anyone whose dependencies aren't green).

I'm not talking about updating dependencies for a good reason (covered below); I'm talking about the practice of updating dependencies for the sake of keeping the dependencies updated. In the best possible case, a dependency update does nothing. The application keeps running exactly as it was. In the worst case, your servers crash, millions of dollars of business value are affected, state is compromised, or worse.

One day at Twilio we tried to deploy code that had a defect in it. The deployment tool noticed the errors and tried to rollback the deployment by putting the old nodes in the load balancer and taking the new ones out. Except... when it went to take the new nodes out, the worker process crashed. So we ended up with both the new (faulty) nodes and the old nodes in the load balancer, and our most reliable tool for cluster management couldn't pull the bad nodes out.

We did some investigation and it turns out one of our dependencies had updated. Well, it wasn't a direct dependency - we locked down all of those - it was a dependency of a dependency, which upgraded to version 3, and introduced an incompatibility.

Fundamentally, updating dependencies is a dangerous operation. Most people would never deploy changes to production without having them go through code review, but I have observed that many feel comfortable bumping a package.json number without looking at the diff of what changed in the dependency.

New releases of dependencies are usually less tested in the wild than older versions. We know the best predictor of the number of errors in code is the number of lines written. The current version of your dependency (which you know works) has 0 lines of diff when compared with itself, but the newest release has a greater-than-0 number of lines of code changed. Code changes are risky, and so are dependency updates.

Errors and crashes introduced by dependency updates are difficult to debug. Generally, the errors are not in your code; they're deep in a node_modules or site-packages folder or similar. You are probably more familiar with your application code than the intricacies of your third party tools. Tracking down the error usually involves figuring out what version of the code used to be running (not easy!) and staring at a diff between the two.

But my tests will catch any dependency errors, you say. Maybe you have great coverage around your application. But do the dependencies you're pulling in have good test coverage? Are the interactions between your dependencies tested? How about the interactions between the dependency and every possible version of a subdependency? Do your tests cover every external interface?

But the dependencies I'm pulling in use semver, so I'll know if things break. This only saves you if you actually read the CHANGELOG, or the package author correctly realizes a breaking change. Otherwise you get situations like this. Which just seems sad; the reporter must have taken time to update the package, then gotten an error report, then had to figure out what change crashed the servers, then mitigated the issue. A lot of downside there - wasted time and the business fallout of a crashing application, and I'm struggling to figure out what benefit the reporter got from updating to the latest possible version.

When to Update Dependencies

Generally I think you should lock down the exact versions of every dependency and sub-dependency that you use. However, there are a few cases where it makes sense to pull in the latest and greatest thing. In every case, at the very least I read the CHANGELOG and scan the package diff before pulling in the update.

Security Upgrades

An application issues a new release to patch a security vulnerability, and you need to get the latest version of the app to patch the same hole. Even here, you want to ensure that the vulnerability actually affects your application, and that the changed software does not break your application. You may not want to grab the entire upstream changeset, but only port in the patch that fixes the security issue.

Performance Improvement

Recently we noticed our web framework was sleeping for 50ms on every POST and PUT request. Of course you would want to upgrade to avoid this (although we actually fixed it by removing the dependency).

You Need a Hot New Feature

We updated mocha recently because it wouldn't print out stack traces for things that weren't Error objects. We submitted a patch and upgraded mocha to get that feature.

You Need a Bug Fix

Your version of the dependency may have a defect, and upgrading will fix the issue. Ensure that the fix was actually coded correctly.

A Final Warning

Updated dependencies introduce a lot of risk and instability into your project. There are valid reasons to update and you'll need to weigh the benefit against the risk. But updating dependencies just for the sake of updating them is just going to run you into trouble.

You can avoid all of these problems by not adding a dependency in the first place. Think really, really hard before reaching for a package to solve your problem. Actually read the code and figure out if you need all of it or just a subset. Maybe if you need to pluralize your application's model names, it's okay to just add an 's' on the end, instead of adding the pluralize library. Sure, the Volcano model will be Volcanos instead of Volcanoes but maybe that's okay for your system.

Unfortunately my desired solution for adding dependencies - fork a library, rip out the parts you aren't using, and copy the rest directly into your source tree - isn't too popular. But I think it would help a lot with enforcing the idea that you own your dependencies and the code changes inside.

Liked what you read? I am available for hire.

Avoid Priming Interview Candidates

Posted on September 22, 2015

To try and avoid having interview candidate start every interview with the same "tell me your background" boilerplate, we get every interviewer in the room with the candidate at the beginning of the interview for a few minutes, so everyone can introduce themselves and the candidate can walk through their background.

Recently we had a minority candidate with a unique name come for an onsite interview. One person in the loop, who also has an untraditional name, asked "what's the worst butchering of your name you've ever heard?" Seems like an innocuous enough question, and it's possible the candidate didn't think twice about it.

But it stuck with me because of a paper I saw recently, and because we may have unknowingly implicitly activated a social identity. Cue the research:

Recent studies have documented that performance in a domain is hindered when individuals feel that a sociocultural group to which they belong is negatively stereotyped in that domain. We report that implicit activation of a social identity can facilitate as well as impede performance on a quantitative task. When a particular social identity was made salient at an implicit level, performance was altered in the direction predicted by the stereotype associated with the identity. Common cultural stereotypes hold that Asians have superior quantitative skills compared with other ethnic groups and that women have inferior quantitative skills compared with men. We found that Asian-American women performed better on a mathematics test when their ethnic identity was activated, but worse when their gender identity was activated, compared with a control group who had neither identity activated. Cross-cultural investigation indicated that it was the stereotype, and not the identity per se, that influenced performance.

The full paper is available here.

In the study, the researchers gave female Asian-American undergrads a math test. However, some were primed to think about the fact that they were women (via questions about whether they lived in a single-sex or coed dorm, and why they would prefer a coed or single-sex floor), and some were primed to think about the fact that they were Asian (via questions about whether their parents/grandparents spoke English, and how many generations they had lived in America). The female-primed group answered 43% of questions correctly, the neutral (no priming questions) answered 49% correctly, and the Asian-primed group answered 53% of questions correctly. That's a pretty large difference! The original study was replicated in 2014 and the same effects were found (on people who are aware of the stereotypes).

So back to our interview! It's possible that by asking a well-intentioned icebreaker question about the candidate's name, we were actually priming their racial identity, and leading them to do worse in an interview. We accidentally emphasized the difference between the candidate and ourselves (white people with names like "Kevin"). We talked about it afterwards, and shared around a link to the paper above, in the hopes that we are giving every candidate the same playing field.

This stuff is hard to get right, you have to think about it, and you have to be aware of the research. We're not perfect, but we know these effects exist and we're trying hard to fight them.

Liked what you read? I am available for hire.

Don’t Use Sails (or Waterline)

Posted on September 6, 2015

The Shyp API currently runs on top of the Sails JS framework. It's an extremely popular framework - the project has over 11,000 stars on Github, and it's one of the top 100 most popular projects on the site. However, we've had a very poor experience with it, and with Waterline, the ORM that runs underneath it. Remember when you learned that java.net.URL does a DNS lookup to check whether a URL is equivalent to another URL? Imagine finding an issue like that every two weeks or so and that's the feeling I get using Sails and Waterline.

The project's maintainers are very nice, and considerate - we have our disagreements about the best way to build software, but they're generally responsive. It's also clear from the error handling in the project (at least the getting started docs) that they've thought a lot about the first run experience, and helping people figure out the answers to the problems they encounter trying to run Sails.

That said, here are some of the things we've struggled with:

The sailsjs.org website broke all incoming Google links ("sails views", "sails models", etc), as well as about 60,000 of its own autogenerated links, for almost a year. Rachael Shaw has been doing great work to fix them again, but it was pretty frustrating that documentation was so hard to find for that long.
POST and PUT requests that upload JSON or form-urlencoded data sleep for 50ms in the request parser. This sleep occupied about 30% of the request time on our servers, and 50-70% of the time in controller tests.
The community of people who use Sails doesn't seem to care much about performance or correctness. The above error was present for at least a year and not one person wondered why simple POST requests take 50ms longer than a simple GET. For a lot of the issues above and below it seems like we are the only people who have ran into them, or care.
By default Sails generates a route for every function you define in a controller, whether it's meant to be public or not. This is a huge security risk, because you generally don't think to write policies for these implicitly-created routes, so it's really easy to bypass any authentication rules you have set up and hit a controller directly.

Blueprints are Sails's solution for a CRUD app and we've observed a lot of unexpected behavior with them. For one example, passing an unknown column name as the key parameter in a GET request (?foo=bar) will cause the server to return a 500.
If you want to test the queries in a single model, there's no way to do it besides loading/lifting the entire application, which is dog slow - on our normal sized application, it takes at least 7 seconds to begin running a single test.
Usually when I raise an issue on a project, the response is that there's some newer, better thing being worked on, that I should use instead. I appreciate that, but porting an application has lots of downside and little upside. I also worry about the support and correctness of the existing tools that are currently in wide use.
Hardcoded typos in command arguments.
No documented responsible disclosure policy, or information on how security vulnerabilities are handled.

Waterline

Waterline is the ORM that powers Sails. The goal of Waterline is to provide the same query interface for any database that you would like to use. Unfortunately, this means that the supported feature set is the least common denominator of every supported database. We use Postgres, and by default this means we can't get a lot of leverage out of it.

These issues are going to be Postgres oriented, because that's the database we use. Some of these have since been fixed, but almost all of them (apart from the data corruption issues) have bit us at one point or another.*

No support for transactions. We had to write our own transaction interface completely separate from the ORM.
No support for custom Postgres types (bigint, bytea, array). If you set a column to type 'array' in Waterline, it creates a text field in the database and serializes the array by calling JSON.stringify.
If you define a column to be type 'integer', Waterline will reject things that don't look like integers... except for Mongo IDs, which look like "4cdfb11e1f3c000000007822". Waterline will pass these through to the database no matter what backend data store you are using.
Waterline offers a batch interface for creating items, e.g. Users.create([user1, user2]). Under the hood, however, creating N items issues N insert requests for one record each, instead of one large request. 29 out of 30 times, the results will come back in order, but there used to be a race where sometimes create will return results in a different order than the order you inserted them. This caused a lot of intermittent, hard-to-parse failures in our tests until we figured out what was going on.
Waterline queries are case insensitive; that is, Users.find().where(name: 'FOO') will turn into SELECT * FROM users WHERE name = LOWER('FOO');. There's no way to turn this off. If you ask Sails to generate an index for you, it will place the index on the uppercased column name, so your queries will miss it. If you generate the index yourself, you pretty much have to use the lowercased column value & force every other query against the database to use that as well.
The .count function used to work by pulling the entire table into memory and checking the length of the resulting array.
No way to split out queries to send writes to a primary and reads to a replica. No support for canceling in-flight queries or setting a timeout on them.
The test suite is shared by every backend adapter; this makes it impossible for the Waterline team to write tests for database-specific behavior or failure handling (unique indexes, custom types, check constraints, etc). Any behavior specific to your database is poorly tested at best.
"Waterline truncated a JOIN table". There are probably more issues in this vein, but we excised all .populate, .associate, and .destroy calls from our codebase soon after this, to reduce the risk of data loss.
When Postgres raises a uniqueness or constraint violation, the resulting error handling is very poor. Waterline used to throw an object instead of an Error instance, which means that Mocha would not print anything about the error unless you called console.log(new Error(err)); to turn it into an Error. (It's since been fixed in Waterline, and I submitted a patch to Mocha to fix this behavior, but we stared at empty stack traces for at least six months before that). Waterline attempts to use regex matching to determine whether the error returned by Postgres is a uniqueness constraint violation, but the regex fails to match other types of constraint failures like NOT NULL errors or partial unique indexes.
The error messages returned by validation failures are only appropriate to display if the UI can handle newlines and bullet points. Parsing the error message to display any other scenario is very hard; we try really hard to dig the underlying pg error object out and use that instead. Mostly nowadays we've been creating new database access interfaces that wrap the Waterline model instances and handle errors appropriately.

Conclusion

I appreciate the hard work put in by the Sails/Waterline team and contributors, and it seems like they're really interested in fixing a lot of the issues above. I think it's just really hard to be an expert in sixteen different database technologies, and write a good library that works with all of them, especially when you're not using a given database day in and day out.

You can build an app that's reliable and performant on top of Sails and Waterline - we think ours is, at least. You just have to be really careful, avoid the dangerous parts mentioned above, and verify at every step that the ORM and the router are doing what you think they are doing.

The sad part is that in 2015, you have so many options for building a reliable service, that let you write code securely and reliably and can scale to handle large numbers of open connections with low latency. Using a framework and an ORM doesn't mean you need to enter a world of pain. You don't need to constantly battle your framework, or worry whether your ORM is going to delete your data, or it's generating the correct query based on your code. Better options are out there! Here are some of the more reliable options I know about.

Instagram used Django well through its $1 billion dollar acquisition. Amazing community and documentation, and the project is incredibly stable.
You can use Dropwizard from either Java or Scala, and I know from experience that it can easily handle hundreds/thousands of open connections with incredibly low latency.
Hell, the Go standard library has a lot of reliable, multi-threaded, low latency tools for doing database reads/writes and server handling. The third party libraries are generally excellent.

I'm not amazingly familiar with backend Javascript - this is the only server framework I've used - but if I had to use Javascript I would check out whatever the Walmart and the Netflix people are using to write Node, since they need to care a lot about performance and correctness.

_{*: If you are using a database without traditional support for
transactions and constraints, like Mongo, correct behavior is going to be very
difficult to verify. I wish you the best of luck.}

Liked what you read? I am available for hire.

Stepping Up Your Pull Request Game

Posted on August 30, 2015

Okay! You had an idea for how to improve the project, the maintainers indicated they'd approve it, you checked out a new branch, made some changes, and you are ready to submit it for review. Here are some tips for submitting a changeset that's more likely to pass through code review quickly, and make it easier for people to understand your code.

A Very Brief Caveat

If you are new to programming, don't worry about getting these details right! There are a lot of other useful things to learn first, like the details of a programming language, how to test your code, how to retrieve data you need, then parse it and transform it into a useful format. I started programming by copy/pasting text into the WordPress "edit file" UI.

Write a Good Commit Message

A big, big part of your job as an engineer is communicating what you're doing to other people. When you communicate, you can be more sure that you're building the right thing, you increase usage of things that you've built, and you ensure that people don't assign credit to someone else when you build things. Part of this job is writing a clear message for the rest of the team when you change something.

Fortunately you are probably already doing this! If you write a description of your change in the pull request summary field, you're already halfway there. You just need to put that great message in the commit instead of in the pull request.

The first thing you need to do is stop typing git commit -m "Updated the widgets" and start typing just git commit. Git will try to open a text editor; you'll want to configure this to use the editor of your choice.

A lot of words have been written about writing good commit messages; Tim Pope wrote my favorite post about it.

How big should a commit be? Bigger than you think; I rarely use more than one commit in a pull request, though I try to limit pull requests to 400 lines removed or added, and sometimes break a multiple-commit change into multiple pull requests.

Logistics

If you use Vim to write commit messages, the editor will show you if the summary goes beyond 50 characters.

If you write a great commit message, and the pull request is one commit, Github will display it straight in the UI! Here's an example commit in the terminal:

And here's what that looks like when I open that commit as a pull request in Github - note Github has autofilled the subject/body of the message.

Note if your summary is too long, Github will truncate it:

Review Your Pull Request Before Submitting It

Before you hit "Submit", be sure to look over your diff so you don't submit an obvious error. In this pass you should be looking for things like typos, syntax errors, and debugging print statements which are left in the diff. One last read through can be really useful.

Everyone struggles to get code reviews done, and it can be frustrating for reviewers to find things like print statements in a diff. They might be more hesitant to review your code in the future if it has obvious errors in it.

Make Sure The Tests Pass

If the project has tests, try to verify they pass before you submit your change. It's annoying that Github doesn't make the test pass/failure state more obvious before you submit.

Hopefully the project has a test convention - a make test command in a Makefile, instructions in a CONTRIBUTING.md, or automatic builds via Travis CI. If you can't get the tests set up, or it seems like there's an unrelated failure, add a separate issue explaining why you couldn't get the tests running.

If the project has clear guidelines on how to run the tests, and they fail on your change, it can be a sign you weren't paying close enough attention.

The Code Review Process

I don't have great advice here, besides a) patience is a virtue, and b) the faster you are to respond to feedback, the easier it will go for reviewers. Really you should just read Glen Sanford's excellent post on code review.

Ready to Merge

Okay! Someone gave you a LGTM and it's time to merge. As a part of the code review process, you may have ended up with a bunch of commits that look like this:

These commits are detritus, the prototypes of the creative process, and they shouldn't be part of the permanent record. Why fix them? Because six months from now, when you're staring at a piece of confusing code and trying to figure out why it's written the way it is, you really want to see a commit message explaining the change that looks like this, and not one that says "fix tests".

@jmhodges In case of a fire, I want a map to the exit. Not the architect's napkin sketch.
— Ben Sandofsky (@sandofsky) July 28, 2015

There are two ways to get rid of these:

Git Amend

If you just did a commit and have a new change that should be part of the same commit, use git amend to add changes to the current commit. If you don't need to change the message, use git amend --no-edit (I map this to git alter in my git config).

Git Rebase

You want to squash all of those typo fix commits into one. Steve Klabnik has a good guide for how to do this. I use this script, saved as rb:

    local branch="$1"
    if [[ -z "$branch" ]]; then
        branch='master'
    fi
    BRANCHREF="$(git symbolic-ref HEAD 2>/dev/null)"
    BRANCHNAME=${BRANCHREF##refs/heads/}
    if [[ "$BRANCHNAME" == "$branch" ]]; then
        echo "Switch to a branch first"
        exit 1
    fi
    git checkout "$branch"
    git pull origin "$branch"
    git checkout "$BRANCHNAME"
    if [[ -n "$2" ]]; then
        git rebase "$branch" "$2"
    else
        git rebase "$branch"
    fi
    git push origin --force "$BRANCHNAME"

If you run that with rb master, it will pull down the latest master from origin, rebase your branch against it, and force push to your branch on origin. Run rb master -i and select "squash" to squash your commits down to one.

As a side effect of rebasing, you'll resolve any merge conflicts that have developed between the time you opened the pull request and the merge time! This can take the headache away from the person doing the merge, and help prevent mistakes.

Liked what you read? I am available for hire.

Logging Database Queries in CircleCI

Posted on August 25, 2015

Our test suite has been failing maybe 2% of the time with a pretty opaque error message. I thought the database logs would have more information about the failure so I figured out how to turn them on with Circle.

Add the following to the database section of your circle.yml:


database:
  override:
    - sudo sed -i "s/#logging_collector = off/logging_collector = on/" /etc/postgresql/9.4/main/postgresql.conf
    - sudo sed -i "s/#log_statement = 'none'/log_statement = 'all'/" /etc/postgresql/9.4/main/postgresql.conf
    - sudo sed -i "s/#log_duration = off/log_duration = on/" /etc/postgresql/9.4/main/postgresql.conf
    - sudo sed -i "s/#log_disconnections = off/log_disconnections = on/" /etc/postgresql/9.4/main/postgresql.conf
    - sudo mkdir -p /var/lib/postgresql/9.4/main/pg_log
    - sudo chown -R postgres:postgres /var/lib/postgresql/9.4/main/pg_log
    - sudo service postgresql restart

Those settings will:

enable various logging settings in the Postgres config
create the pg_log directory (Postgres can't write if the directory doesn't exist)
change ownership of the directory to the postgres user
restart Postgres (needed to enable logging).

Finally to trigger the error you're going to want to run the tests a bunch of times.


test:
  override:
    - while true; do your_test_command | tee -a test.log ; if [[ ${PIPESTATUS[0]} -ne 0 ]]; then break; fi; done; exit 1

Note you need to use the ${PIPESTATUS} array instead of $? because you're piping the output of your test runner to the test log. You're going to want to pipe the output of your test runner to the test log because Circle can't render text beyond a certain threshold.

Finally, enable SSH on your build so you can view the test log file you've been writing, and your database logs, which will be in /var/lib/postgresql/9.4/main/pg_log.

Happy hunting!

Liked what you read? I am available for hire.

Fixing Engineering’s Loyalty and Longevity Problem

Posted on May 19, 2015

Pay people more! It’s really that simple. If you care so much about retention, put your money where your mouth is.

Liked what you read? I am available for hire.

Node’s `require` is dog slow

Posted on April 19, 2015

Our test environment takes 6-9 seconds to load before any tests get run. I tire of this during the ~30 times I run the test suite a day,¹ so I wanted to make it faster.

For better or worse, the API runs on Sails.js. Before running model/controller tests, a bootstrap file in our tests calls sails.lift.

require('sails').lift(function(err) {
    // Run the tests
});

This lift call generates about 400 queries against Postgres to retrieve database schema, that each look like this:

SELECT x.nspname || '.' || x.relname as "Table", x.attnum
as "#", x.attname as "Column", x."Type", case x.attnotnull
when true then 'NOT NULL' else '' end as "NULL",
r.conname as "Constraint", r.contype as "C", r.consrc,
fn.nspname || '.' || f.relname as "F Key", d.adsrc as "Default"
FROM (SELECT c.oid, a.attrelid, a.attnum, n.nspname, c.relname,
a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) as "Type",
a.attnotnull FROM pg_catalog.pg_attribute a, pg_namespace n,
pg_class c WHERE a.attnum > 0 AND NOT a.attisdropped
AND a.attrelid = c.oid and c.relkind not in ('S','v')
and c.relnamespace = n.oid and n.nspname
not in ('pg_catalog','pg_toast','information_schema')) x
left join pg_attrdef d on d.adrelid = x.attrelid
and d.adnum = x.attnum left join pg_constraint r on r.conrelid = x.oid
and r.conkey[1] = x.attnum left join pg_class f on r.confrelid = f.oid
left join pg_namespace fn on f.relnamespace = fn.oid
where x.relname = 'credits' order by 1,2;

I'm not really sure what those queries do, since Sails seems to ignore the schema that's already in the database when generating table queries. Since we use the smallest, safest subset of the ORM we can find, I tried commenting out the sails-postgresql module code that makes those queries to see if the tests would still pass, and the tests did pass... but the load time was still slow.

The next step was to instrument the code to figure out what was taking so long. I wanted to have the Sails loader log the duration of each part of the load process, but this would have required a global variable, and a whole bunch of calls to console.log. It turns out the Unix function ts can do this for you, if log lines are generated at the appropriate times. Basically, it's an instant awesome tool for generating insight into a program's runtime, without needing to generate timestamps in the underlying tool.

NAME
       ts - timestamp input

SYNOPSIS
       ts [-r] [-i | -s] [format]

DESCRIPTION
       ts adds a timestamp to the beginning of each line of input.

I set the Sails logging level to verbose and piped the output through ts '[%Y-%m-%d %H:%M:%.S]'. I pretty quickly found a culprit..

[2015-04-19 21:53:45.730679] verbose: request hook loaded successfully.
[2015-04-19 21:53:45.731032] verbose: Loading the app's models and adapters...
[2015-04-19 21:53:45.731095] verbose: Loading app models...
[2015-04-19 21:53:47.928104] verbose: Loading app adapters...
[2015-04-19 21:53:47.929343] verbose: Loading blueprint middleware...

That's a 2 second gap between loading the models and loading the adapters.

I started adding profiling to the code near the "Loading app models..." line. I expected to see that attaching custom functions (findOne, update, etc) to the Sails models was the part that took so long. Instead I found out that a module called include-all accounted for almost all of the 2.5 seconds. That module simply requires every file in the models directory, about 30 files.

Further reading/logging revealed that each require call was being generated in turn. I've found it, I thought, just use different CPU's to require them all at the same time, and see a speedup. Unfortunately, the require operation is synchronous in Node, and runs on one CPU, so the process can still only perform one require at a time.

I tried just loading one model to see how slow that would be. This script took an average of 700 milliseconds to run, on my high end Macbook Pro:

var start = Date.now();
require('api/models/Drivers');
console.log(Date.now() - start);

700 milliseconds to require a model file, and that file's dependencies! I can send a packet to and from New York 8 times in that amount of time. What the hell is it actually doing? For this I turned to good old strace (or dtruss, as it's ported on Macs). First start up a shell to record syscalls for any process that is called node.

# (You'll want to kill any other node processes before running this.)
sudo dtruss -d -n 'node' > /tmp/require.log 2>&1

Open up another shell session and run your little script that calls require and prints the startup time and then exits. You should have a few thousand lines in a file called /tmp/require.log. Here's what I found near the start:

 1186335 stat64("/Users/burke/code/api/api/models/node_modules/async\0", 0x7FFF5FBFE608, 0x9) 1186382 stat64("/Users/burke/code/api/api/models/node_modules/async.js\0", 0x7FFF5FBFE5B8, 0x9) 1186405 stat64("/Users/burke/code/api/api/models/node_modules/async.json\0", 0x7FFF5FBFE5B8, 0x9) 1186423 stat64("/Users/burke/code/api/api/models/node_modules/async.node\0", 0x7FFF5FBFE5B8, 0x9) 1186438 stat64("/Users/burke/code/api/api/models/node_modules/async.coffee\0", 0x7FFF5FBFE5B8, 0x9) 1186473 open("/Users/burke/code/api/api/models/node_modules/async/package.json\0", 0x0, 0x1B6) 1186501 stat64("/Users/burke/code/api/api/models/node_modules/async/index.js\0", 0x7FFF5FBFE5B8, 0x1B6) 1186519 stat64("/Users/burke/code/api/api/models/node_modules/async/index.json\0", 0x7FFF5FBFE5B8, 0x1B6) 1186534 stat64("/Users/burke/code/api/api/models/node_modules/async/index.node\0", 0x7FFF5FBFE5B8, 0x1B6) 1186554 stat64("/Users/burke/code/api/api/models/node_modules/async/index.coffee\0", 1186580 stat64("/Users/burke/code/api/api/node_modules/async\0", 0x7FFF5FBFE608, 0x1B6) 1186598 stat64("/Users/burke/code/api/api/node_modules/async.js\0", 0x7FFF5FBFE5B8, 0x1B6) 1186614 stat64("/Users/burke/code/api/api/node_modules/async.json\0", 0x7FFF5FBFE5B8, 0x1B6) 1186630 stat64("/Users/burke/code/api/api/node_modules/async.node\0", 0x7FFF5FBFE5B8, 0x1B6) 1186645 stat64("/Users/burke/code/api/api/node_modules/async.coffee\0", 0x7FFF5FBFE5B8, 0x1B6) 1186670 open("/Users/burke/code/api/api/node_modules/async/package.json\0", 0x0, 0x1B6) 1186694 stat64("/Users/burke/code/api/api/node_modules/async/index.js\0", 0x7FFF5FBFE5B8, 0x1B6) 1186712 stat64("/Users/burke/code/api/api/node_modules/async/index.json\0", 0x7FFF5FBFE5B8, 0x1B6) 1186727 stat64("/Users/burke/code/api/api/node_modules/async/index.node\0", 0x7FFF5FBFE5B8, 0x1B6) 1186742 stat64("/Users/burke/code/api/api/node_modules/async/index.coffee\0", 0x7FFF5FBFE5B8, 0x1B6) 1186901 stat64("/Users/burke/code/api/node_modules/async\0", 0x7FFF5FBFE608, 0x1B6) 1186963 stat64("/Users/burke/code/api/node_modules/async.js\0", 0x7FFF5FBFE5B8, 0x1B6) 1187024 stat64("/Users/burke/code/api/node_modules/async.json\0", 0x7FFF5FBFE5B8, 0x1B6) 1187050 stat64("/Users/burke/code/api/node_modules/async.node\0", 0x7FFF5FBFE5B8, 0x1B6) 1187074 stat64("/Users/burke/code/api/node_modules/async.coffee\0", 0x7FFF5FBFE5B8, 0x1B6) 1656 __semwait_signal(0x10F, 0x0, 0x1)                = -1 Err#60 1187215 open("/Users/burke/code/api/node_modules/async/package.json\0"0x0, 0x1B6)

= -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 0x7FFF5FBFE5B8, 0x1B6) = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = 0 0 = -1 Err#2 = -1 Err#2 = -1 Err#2 = -1 Err#2 = 11 0

That's a lot of wasted open()'s, and that's just to find one dependency.² To load the single model, node had to open and read 300 different files,³ and every require which wasn't a relative dependency did the same find-the-node-module-folder dance. The documentation seems to indicate this is desired behavior, and it doesn't seem like there is any way around this.

Now, failed stat's are not the entire reason require is slow, but if I am reading the timestamps correctly in the strace log, they are a fair amount of it, and the most obviously wasteful thing. I could rewrite every require to be relative, e.g. require('../../node_modules/async') but that seems cumbersome and wasteful, when I can define the exact rule I want before hand: if it's not a relative file path, look in node_modules in the top level of my project.

So that's where we are; require for a single model takes 700 milliseconds, require for all the models takes 2.5 seconds, and there don't seem to be great options for speeding that up. There's some discouraging discussion here from core developers about the possibility of speeding up module import.

You are probably saying "load fewer dependencies", and you are right and I would love to, but that is not an option at this point in time, since "dependencies" basically equals Sails, and while we are trying to move off of Sails, we're stuck on it for the time being. Where I can, I write tests that work with objects in memory and don't touch our models/controllers, but Destroy All Software videos only get you so far.

I will definitely think harder about importing another small module vs. just copying the code/license wholesale into my codebase, and I'll definitely look to add something that can warn about unused imports or variables in the code base.

I'm left concluding that importing modules in Node is dog slow. Yes, 300 imports is a lot, but a 700ms load time seems way too high. Python imports can be slow, but as far as I remember, it's mostly for things that compile C on the fly, and for everything else you can work around it by rearranging sys.path to match the most imports first (impossible here). If there's anything I can do - some kind of compile option, or saving the V8 bytecode or something, I'm open to suggestions.

^{1. If you follow along with the somewhat-crazy convention of crashing
your Node process on an error and having a supervisor restart it, that means
that your server takes 6-9 seconds of downtime as well.}

^{2. The story gets even worse if you are writing Coffeescript, since
require will also look for files matching .litcoffee and .coffee.md at
every level. You can hack require.extensions to delete these keys.}

^{3. For unknown reasons, node occasionally decided not to stat/open
some imports that were require'd in the file.}

Liked what you read? I am available for hire.

Beware tech companies who tell you they don’t negotiate

Posted on April 9, 2015

Reddit and Duck Duck Go recently announced that they are eliminating salary negotiations, in part to help even the playing field for men and women, since men are more likely to negotiate salaries than women.

Men negotiate harder than women do and sometimes women get penalized when they do negotiate. So as part of our recruiting process, we don’t negotiate with candidates," Pao said in the interview. "We come up with an offer that we think is fair. If you want more equity, we’ll let you swap a little bit of your cash salary for equity, but we aren’t going to reward people who are better negotiators with more compensation."

I am a little skeptical. It's fine for a company to announce this, but they lose nothing from announcing "we don't negotiate offers". The person you really need to ask is the person who has 5 offers from different tech companies. If they go to Reddit or Duck Duck Go and say, "I'll come work for you for another $5k / $10k with the same equity", are they turned down 100% of the time? If yes, then I'll believe a company with a stated policy that says "we don't reward negotiators".

Announcing "We give fair offers" is a smart move on behalf of a negotiator; it lends better credibility to whatever offer is on the table. It's just not in their best interest to disclose that they negotiate.

In my last round of job interviews, one company made me an offer, told me in strong words that "this offer is non negotiable", then ended up offering me both more salary and more equity.

I would also like to remind everyone that, within the last decade, six of the largest tech companies in the USA conspired to depress wages for engineering employees. I will repeat that last sentence. SIX OF THE LARGEST TECH COMPANIES IN THE USA ACTIVELY COLLUDED TO KEEP ENGINEERING SALARIES DOWN FOR FIVE YEARS. Steve Jobs, a person who many tech CEO's admire and try to emulate, threatened Sergey Brin because Google was trying to recruit Safari engineers.

Yes, Reddit and Duck Duck Go are not those companies, but if half of the National League teams are found colluding, I'm going to have a little skepticism when the Rangers tell me "this is the best offer you're going to get". I almost forgot, that example doesn't even work, because all of the MLB owners recently colluded to depress salaries as well. Given the history, as a tech employee you are wise to take anything tech employers say with a grain of salt.

I applaud Reddit and DDG for taking policy steps to address the gaps between pay for men and women. They may find that starting with a high offer helps them close candidates - great! But I am skeptical that "we don't negotiate" is true, or at least, we're not hearing it from the right party in the negotiation.

Liked what you read? I am available for hire.