You may have seen this on New Year's Eve:
Another leap second, another slew of outages. Handling time correctly is hard!https://t.co/kJepOfsKkv pic.twitter.com/Fwz2Xtpzkd
— Dan Luu (@danluu) January 1, 2017
I'd heard a little about this problem, but I didn't understand how it broke code, and what to do about it. So here is an explainer.
Background
Earlier this year, the International Earth Rotation and Reference Systems Service decided to add an additional second to the year. This is due to the fact that the Earth's day-to-day rotation does not perfectly line up with a 86400 second day.
So when the clock hit 23:59:59 on New Year's Eve, it usually ticks over to 00:00:00, but this year, repeated the second 23:59:59. So if you had measurement code that looked like:
a := time.Now() // Dec 31 2016 23:59:59 and 800ms // While this operation is in progress, the leap second starts doSomeExpensiveOperation() duration := time.Since(a) // Taken at Dec 31 2016 23:59:59 (#2!) and 100ms fmt.Println(duration)
We usually assume in our code that time can't go backwards, but the duration
there is -700ms
! This can cause big problems if you have code that always
expects nonnegative time durations. In particular code that accepts timeouts
might panic or immediately error if it gets a negative value. Cloudflare passed
a negative duration to rand.Int63n
, which crashed some of their
servers.
But this is only a problem on leap seconds, which occur once a year at most. That's not good, but I can handle one problem a year. That would be nice, but unfortunately time on your servers can jump around quite a bit. An individual server might get a few seconds ahead or behind the rest of your servers. When it gets the correct time from the NTP server, the time on the server might jump forward or backward a few seconds. If you run any number of servers you are bound to hit this problem sooner or later.
Ok, how do I deal with it?
Good question! The Linux community implemented CLOCK_MONOTONIC
as an option
to clock_gettime
to solve this problem. CLOCK_MONOTONIC
returns
an integer number of nanoseconds that only ever increases. You can use this to
get more accurate (and always nonnegative) deltas than getting the system time
twice and subtracting the samples.
You'll have to check whether your programming language uses CLOCK_MONOTONIC
for implementing calls to get the system time. Many languages don't use it
for the simple "what time is it" call, because it returns you time from
a random starting point, not the epoch.
So you can't really translate between a CLOCK_MONOTONIC value and any given human date; it's only useful when you take multiple samples and compare them.
In particular
The Go standard library does not use CLOCK_MONOTONIC for calls to
time.Now()
. Go has a function that implements CLOCK_MONOTONIC, but it's
not a public part of the standard library. You can access it via
the monotime library - replace your calls to time.Now()
with
monotime.Now()
. Note you will get back a uint64 that only makes sense if
you call monotime.Now() again a little while later (on the same machine) and
subtract it from the first value you got.
import "time" import "github.com/aristanetworks/goarista/monotime" func main() { a := monotime.Now() someExpensiveOperation() b := monotime.Now() // b - a will *always* be greater than 0 fmt.Println(time.Duration(b - a)) }
Porting code errata
I ported Logrole (a Twilio log viewer) to implement monotonic time.
In the past it was really easy to call t := time.Now()
and then, later,
somewhere else in the code, call diff := time.Since(t)
to get a Duration.
This let you both use t
as a wall clock time, and get an elapsed amount of
time since t
, at some unspecified point later in the codebase and in time.
With CLOCK_MONOTONIC you have to separate these two use cases; you can get
a delta or you can get a Time but you can't get both.
The patch was pretty messy and an indication of the problems you might have in your own codebase.
Another problem to watch out for is a uint64
underflow. If you have code like
this:
if since := time.Since(aDeadline); since > 0 { return nil, errors.New("deadline exceeded") }
You might port it to something like:
now := monotime.Now() deadline := now + uint64(500*time.Millisecond) someExpensiveOperation() if monotime.Now() - deadline > 0 { return nil, errors.New("deadline exceeded") }
The problem here is you are subtracting uint64
values and if now
is under
the deadline, you'll underflow and get a very large uint64
value (which is
greater than 0), so you'll always execute the if condition. Instead you need to
avoid the subraction as well and use
if monotime.Now() > deadline {
instead.
Conclusion
It is hard to get this right, but it's probably a good practice to separate out wall clock time values from time values used for deltas and timeouts. If you don't do this from the start though you are going to run into problems. It would be nice if this was better supported in your language's standard library; ideally you'd be forced to figure out which use case you needed a time value for, and use it for that.
Liked what you read? I am available for hire.
Great writeup. Golang definitely needs monotonic time in the standard library, even though there are some portability warts. If you read through the github issue about it, there’s unfortunately some aggressively dismissive comments from golang core developers. I find this doubly obnoxious since nanotime exists in the runtime and is used in a couple places in the std libs. If the standard library API’s identified the need for monotonic time in a couple places, why on earth are core devs being so dismissive when golang users point out they have similar needs?
> But this is only a problem on leap seconds, which occur once a year at most. That’s not good, but I can handle one problem a year.
Leap seconds can be added on June 30th as well, and the first year with leap seconds had one both at that moment as well as Dec 31st.