Making Your Services More Reliable
What does Five Nines mean?
Five Nines = you can fail 0.001% of the time
Chasing Five Nines
- 5.26 min. downtime per year
- 1 request out of 100,000 fails
- You can blow your yearly budget at one time
This might not be appropriate!
Microservices:
- Rate Limiting
- Auth
- Fraud/IP address scoring
- Validation
- Database service 1
- Database service 2
- Billing
- Feature flags
Many more opportunities for failure!
What happens when I type google.com into my browser?
DNS Lookup
man getaddrinfo
Establish TCP Connection
man connect
Write Request
man 2 write
ENom - 5 million domains
Cloudfront - 100 Minutes Downtime
DNSimple - 11 Hour Downtime
DNS Server is Down/Unreachable
You Might Be Vulnerable If...
- Mobile apps that use DNS to connect to an API
- Client libraries that use DNS to connect to an API
- Your service uses DNS to connect to third party API's (Stripe, Mailchimp...)
DNS Resolver is Down - Workarounds
- Use multiple resolvers
- Internally, connect directly to IP addresses instead of DNS
- Set shorter timeout in /etc/resolv.conf
- If you don't get a response, ignore TTL
DNS Provider is Down - Workarounds
- Maintain DNS records at two hosts
- Reference both in nameserver list
When something is taking too long, you abandon it
Your users have a timeout, whether your system does or not
Outside In
Setting Timeouts - 2 Questions
- What's the maximum acceptable response time?
- How long can my service afford to spend processing requests?
Hard Math Stuff
- Thread pool with 100 workers
- Each request takes 1 second. 20 requests come in per second.
- Downstream server is slow, requests take 6 seconds each
- Set timeout to prevent this
Fail early if you can't serve a request
Socket Timeouts Are Liars
Remote Server Unreachable
HAProxy
retries
is the number of times a connection attempt should be retried on
a server when a connection either is refused or times out. The
default value is 3.
Timeouts
One Timeout Value = set on both Connect/Read
Separate Connect Timeout
Not available in the standard library!
Timeouts - Workarounds
- *Set timeouts*
- Set shorter, separate, connect timeout
- Know how your HTTP client behaves
Timeouts - Wall Clock Timeout
Timeouts - Measure
Retries - Temporal Failures
Retries - Single Component Failures
Exponential Backoff
1, 2, 4, 8...
Exponential Backoff with Jitter
1.01, 2.03, 3.9, 8.2...
Idempotent Actions
- Retrieving a profile
- Setting user's email to foo@bar.com
- Emptying a bank account
Idempotent Requests
You can always retry idempotent actions!
Not Idempotent
- Sending an email
- Charging a credit card
Not Idempotent Requests
You can retry if the data never made it (connection timeout, connection error)
Not Idempotent Requests
Determining whether the data made it is hard
Not Idempotent Requests
You can retry if you get a 429 or a 503 (carefully!)
How do I make things idempotent?
Idempotent (retryable) request
Idempotent (retryable) request
Requires sid collision handling!
Thanks!
Kevin Burke
These slides are available at:
←
→
/
#