Do you have objects in your system that can be in different states (accounts, invoices, messages, employees)? Do you have code that updates these objects from one state to another? If so, you probably want a state machine.
What is a state machine?
At its root, a state machine defines the legal transitions between states in your system, is responsible for transitioning objects between states, and prevents illegal transitions.
You sound like an architecture astronaut, why do I need this?
Let's talk about some bad things that can happen if you don't have a state machine in place.
(Some of these actually happened! Some are invented.)
A user submits a pickup. We pick up the item and ship it out. Two weeks later, a defect causes the app to resubmit the same pickup, and reassign a driver, for an item that's already been shipped.
Two users submit a pickup within a second of each other. Our routing algorithm fetches available drivers, computes each driver's distance to the pickup, and says the same driver is available for both pickups. We assign the same driver to both pickups.
A user submits a pickup. A defect in a proxy causes the submit request to be sent multiple times. We end up assigning four drivers to the pickup, and sending the user four text messages that their pickup's been assigned.
An item is misplaced at the warehouse and sent straight to the packing station. Crucial steps in the shipping flow get skipped.
Code for updating the state of an object is littered between several different classes and controllers, which handle the object in different parts of its lifecycle. It becomes difficult to figure out how the object moves between various states. Customer support tells you that an item is in a particular state that makes it impossible for them to do their jobs. It takes great effort to figure out how it got there.
These are all really bad positions to be in! A lot of pain for you and a lot of pain for your teams in the field.
You are already managing state (poorly)
Do you have code in your system that looks like this?
def submit(pickup_id): pickup = Pickups.find_by_id(pickup_id) if pickup.state != 'DRAFT': throw new StateError("Can't submit a pickup that's already been submitted") pickup.state = 'SUBMITTED' pickup.save() MessageService.send_message(pickup.user.phone_number, 'Your driver is on the way!')
By checking the state of the pickup before moving to the next state, you're managing the state of your system! You are (at least partially) defining what transitions are allowed between states, and what transitions aren't. To avoid the issues listed above, you'll want to consolidate the state management in one place in your codebase.
Okay, how should I implement the state machine?
There are a lot of libraries that promise to manage this for you. You don't need any of them (too much complexity), and you don't need a DSL. You just need a dictionary and a single database query.
The dictionary is for defining transitions, allowable input states, and the output state. Here's a simplified version of the state machine we use for Pickups.
states = { submit: { before: ['DRAFT'], after: 'SUBMITTED', }, assign: { before: ['SUBMITTED'], after: 'ASSIGNED', }, cancel: { before: ['DRAFT', 'SUBMITTED', 'ASSIGNED'], after: 'CANCELED', }, collect: { before: ['ASSIGNED'], after: 'COLLECTED', }, }
Then you need a single function, transition
, that takes an object ID, the
name of a transition, and (optionally) additional fields you'd like to set on
the object if you update its state.
The transition
function looks up the transition in the states
dictionary,
and generates a single SQL query:
UPDATE table SET state = 'newstate', extraField1 = 'extraValue1' WHERE id = $1 AND state IN ('oldstate1', 'oldstate2') RETURNING *
If your UPDATE query returns a row, you successfully transitioned the item! Return the item from the function. If it returns zero rows, you failed to transition the item. It can be tricky to determine why this happened, since you don't know which (invalid) state the item was in that caused it to not match. We cheat and fetch the record to give a better error message - there's a race there, but we note the race in the error message, and it gives us a good idea of what happened in ~98% of cases.
Note what you don't want to do - you don't want to update the object
in memory and then call .save()
on it. Not only is .save()
dangerous, but fetching the item before you attempt to UPDATE it means
you'll be vulnerable to race conditions between two threads attempting to
transition the same item to two different states (or, twice to the same state).
Ask, don't tell - just try the transition and then handle success or failure
appropriately.
Say you send a text message to a user after they submit their pickup - if two
threads can successfully call the submit
transition, the user will get 2 text
messages. The UPDATE query above ensures that exactly one thread will succeed
at transitioning the item, which means you can (and want to) pile on whatever
only-once actions you like (sending messages, charging customers, assigning
drivers, &c) after a successful transition and ensure they'll run once.
For more about consistency, see Weird Tricks to Write Faster, More Correct
Database Queries.
Being able to issue queries like this is one of the benefits of using a relational database with strong consistency guarantees. Your mileage (and the consistency of your data) may vary when attempting to implement a state transition like this using a new NoSQL database. Note that with the latest version of MongoDB, it's possible to read stale data, meaning that (as far as I can tell) the WHERE clause might read out-of-date data, and you can apply an inconsistent state transition.
Final Warnings
A state machine puts you in a much better position with respect to the consistency of your data, and makes it easy to guarantee that actions performed after a state transition (invoicing, sending messages, expensive operations) will be performed exactly once for each legal transition, and will be rejected for illegal transitions. I can't stress enough how often this has saved our bacon.
You'll still need to be wary of code that makes decisions based on other
properties of the object. For example, you might set a driver_id
on the
pickup when you assign it. If other code (or clients) decide to make a decision
based on the presence or absence of the driver_id
field, you're making a
decision based on the state of the object, but outside of the state machine
framework, and you're vulnerable to all of the bullet points mentioned above.
You'll need to be vigilant about these, and ensure all calling code is making
decisions based on the state
property of the object, not any auxiliary
properties.
You'll also need to be wary of code that tries a read/check-state/write pattern; it's vulnerable to the races mentioned above. Always always just try the state transition and handle failure if it fails.
Finally, some people might try to sneak in code that just updates the object state, outside of the state machine. Be wary of this in code reviews and try to force all state updates to happen in the StateMachine class.
Liked what you read? I am available for hire.
What transaction isolation level do you suggest for that update statement?
Hey! You don’t need an isolation level since it’s a single statement, there’s no chance to read or write stale data!
I had a situstion where 2 processes could call the same update against Oracle. The second one was blocked by the first transaction, but executed after the first commit, resulting in duplicate downstream messages being generated. The ‘SKIP LOCKED’ clause fixed it, but the moral of the story is to test concurrent update attempts against your back end data store.
Thanks for this post! It’s very timely for me – I’m at a point where I realize I need state machines, and seeing your simple query-centric implementation helps glue things together in my mind.