Author Archives: kevin

About kevin

I write the posts

Ethical Considerations for Software Engineers

Posted on December 15, 2016

The next president of the United States showed a willingness to violate historical norms while campaigning, and there's little evidence that he has any moral compass - the examples of this are legion, one of the worst is him cutting off medical treatment to his sick nephew over a legal dispute. His kids are going to run his businesses (with his name on them) while he is in office. He has also asked for security clearances for them. This is at best an unusual arrangement and at worst opens the door to massive corruption.

During the election the Russian government hacked and leaked the DCC's emails, then hacked and leaked the email of Hillary Clinton's campaign chief. Trump denied Russia's involvement publicly at a debate even though he'd been briefed on it. Trump has taken many sides on many issues but praise for Putin and Russia has been consistent. Trump just promoted a paid Russia Today commentator to his National Security Adviser. It is likely that Russian (and Chinese, Iranian, etc) hacking of US government offices and US companies will be tolerated over the next four years, especially if it benefits Trump and hurts his political allies.

It's important to note these attacks won't come out of the blue. It's not sunny one day and the next there are men in suits asking for data center access. There will probably be some pretext - a foreign war, a terror attack, something else, that'll be used to justify the unethical request. It's easy to imagine "Of course I will identify the racist thing!" and much harder in the moment, or in a context that's surrounded by fear.

Note also that if you are an engineer, these requests may come outside of normal channels. Last year, Yahoo fielded a request to search all emails for a given term. Yahoo's C-level executives went around the security team and asked engineers to implement this directly, at an extremely low level. Alex Stamos, Yahoo's CSO, resigned when he found out. You should be prepared to do the same. Don't expect unethical requests to show up on the backlog - it'll be a meeting you're pulled into with the CTO, or a man showing up at your apartment and threatening your immigration status unless you insert a backdoor.

Employees (and especially engineers) will be the key people to push back. Customers aren't always aware of shenanigans, and management can be under more pressure to make their company succeed. Especially in Silicon Valley, most employees have multiple job options, which gives us unique leverage. Every employee at a Silicon Valley company should be prepared for unethical or illegal requests, and (where appropriate) be prepared for state sponsored attacks, from the US government or another one. Every employee should be prepared to put pressure on management, and the legal team, to deny requests.

Here are some examples of ethical problems you might run into. I'd encourage you to have these discussions internally before you get put in the situation discussed below, and lay out bright lines for everyone in the company to follow, to make it clear where you stand and what's not acceptable. I would also encourage you to ask about these when you interview.

All

The pledge at neveragain.tech has covered this in more detail but here are some good questions to ask in an interview:

Do you encrypt messages that go from datacenter to datacenter? The NSA has spied on this data in the past.
Do you offer end-to-end encryption of messages sent between users?
Do you destroy sensitive data if it's not needed anymore? Do you destroy user data if they delete their accounts?
What is your policy to responding to requests from the US government and other governments?
Do you have data that would be valuable to foreign governments, or embarrassing to customers if it was made public? What's your strategy for protecting that data against sophisticated nation states?
Would you take money from the Trump Organization or its affiliates in exchange for an explicit or implicit guarantee of "protection"?

Venture Capitalists / CEO's

Donald Trump's children or their representatives may ask for a share in your fund, in exchange for favorable treatment from the federal government. Would you accept such a request? Note they may ask after they have successfully applied this approach to other companies.
You may be approached for an investment by a company or entity that has ties to the Russian government, or ties to the Trump Organization. This may be accompanied by a threat of harassment from the federal government, hacking, DDOS, or other. Would you accept the investment?

Slack

By default you store a company's entire conversation history, including DM's. Private information like this is easy to distort and take out of context. Russians hacked from the DCC and trickled emails to the press, with devastating effects. Should the default behavior for a Slack installation be to store a company's entire history?
What efforts are you making to educate users about the risks of storing their entire conversation history on Slack? What are the highest-value targets for hackers who'd like to compromise the Slack network?
What progress have you made on end-to-end encryption for Slack messages?
Is there a way to store the data where a compromise would not allow a hacker to access every message for every company in your system? Say you had three different datastore designs.

Uber/Lyft

Your companies store a massive amount of data on where users have been and where they are going. If exposed, this data could be used to embarrass people - why is this married Congressman requesting a ride from outside a gay bar, or a hotel in the middle of the day?
What options do users have for removing their trip history from your site?
What employees can access user data, and under what circumstances? What tools do you have for anonymizing data that's not viewed in aggregate?
Many Trump voters cited a feeling of being left behind as a reason to vote for him. Uber drivers are 1099 contractors, which means you are prohibited from providing them with training. What responsibility do corporations have to put their workers on an upwards career path?
Many of your 1099 contractors get health care from the government, or on government-mandated exchanges. These exchanges are being threatened by Republican governors in many states, and Republicans in Congress. What responsibility does Uber have to work for healthcare for its drivers?
Your legal page says "We generally require a valid request issued in accordance with applicable law before we can process private requests for information." What does "generally" mean in this context? If China passes a law that says "we can ask for everything," would Uber comply?
You've taken money from Saudi Arabia's public investment arm. Would you be say no to that money if the Saudi Arabian government asked for data on customers as a condition of the deal?

Stripe/Braintree

You collected millions of dollars in revenue from the Trump campaign in 2016. If Trump acts like an authoritarian in office, or severely restricts the rights of minorities or immigrants, will you support his campaign again in 2020?
Does Stripe receive requests from law enforcement? What is your policy for responding to subpoenas?
If Stripe processes a credit card payment, who can see the record of that transaction? Who should be able to see it, and/or remove it?

Twilio

Do you encrypt messages passing from datacenter to datacenter?

Facebook

Historically newspapers and other media organizations have had a strong understanding of their role in promoting democracy and enforcing accountability from the government and our business leaders. Facebook has become a very important part of how people figure out what's going on in the world around them. What responsibility does Facebook have to ensure people have a mostly-correct view of the world? Should Facebook have a role in promoting democracy and in rejecting authoritarianism?
Facebook tells advertisers that their ads can change users' minds. But Facebook also insists that the algorithms it uses to show information didn't sway the US election (or overseas elections). Which is it?
Has Facebook responded to queries from governments on the lines of "Muslims/blacks/immigrants living in state/city/county X"?
Facebook's current policy is to censor/restrict content according to local laws. If a law was passed to restrict speech in the United States, would Facebook comply?
Does Facebook encrypt data being sent from datacenter to datacenter?

Twitter

What line would Donald Trump have to cross for you to suspend or ban his account?

In sum

You are the most likely agent of change at your company. A lot of stuff may happen in the next four years and it's good to think and declare now, when things are relatively sane, what you'll agree to do or not do, because in the aftermath of another 9/11, or similar event, you may be asked to do a lot.

I've laid out my own consulting ethics guide here.

Liked what you read? I am available for hire.

Tradeoffs in Software Provisioning Tools

Posted on November 27, 2016

A while ago my friend Alan and I were discussing configuration management. In particular we wondered why every configuration management tool has to ship a DSL, or be loaded from YAML files.

We wondered if it would be possible to just write code that deploys servers — it might let you describe what you want to do much more precisely.

I started working on a library that lets you do this. Basically, turn every module + state combination in Ansible into a function, add any required arguments as part of the method signature, and add a Opts dictionary for any optional arguments. Right now it looks something like this.

if err := core.AddGroup(ctx, host, "wheel", core.GroupOpts{
    System: false,
    Gid: "1001",
}); err != nil {
    log.Fatal(err)
}

But starting to implement this led to several more non-obvious tradeoffs.

Abstraction

This is the most obvious reason to use a configuration management tool. Whether you are deploying OpenBSD or Ubuntu or Darwin, you still need to create users and create folders and install packages. A good provisioning tool will abstract these for you, and choose sensible defaults.

However abstractions can be leaky; maybe one filesystem offers a feature that others don't, and it can be hard to make this available while also saying "this is only supported on these systems."

Run Commands On Local Machine vs. Remote Machine

Do you want to run commands on the machine that triggered the provisioning process, or the machine being provisioned? Take mysql for example. If you have a mysql client on the local machine, you can issue commands to the remote machine using the mysql protocol on the wire.

This requires the remote machine to expose port 3306, which you might not want to do. You also need to trust mysql's ability to encrypt a connection and trust a remote certificate. Compared with SSH, mysql has had much less auditing of its security code, and is not as good of a bet for encrypting/safely compressing content going over the wire. (This becomes more salient when you have N protocols for issuing commands over the wire, instead of just SSH.)

Another option would be to SSH to the remote machine, then run a mysql client on that machine to configure/provision MySQL. But this requires that you have a MySQL client on the remote machine. It's also considerably trickier to issue multiple MySQL commands over a single SSH session, and take action based on the results.

Run Multiple Commands Per SSH Connection

A single task in Ansible for "create this recursive directory" embeds a ton of complexity. If the directory exists, but has the wrong permissions, or the wrong owner, the directories are recursively created and chowned/chmodded. You can do this with SSH commands, e.g. ssh host mkdir foo && chmod 755 foo && chown -R root:wheel foo, but it gets more and more complicated, and tougher to determine which command failed, the more commands you layer on.

You can work around this by issuing each command as part of a single SSH connection, then getting the result, and making some decision based on it. But this vastly increases the latency of what you're trying to do, even if you enable pipelining you're talking about 1 second latency per operation.

Ansible works around this by copying a Python file to the remote machine, then running that file on the remote machine with several arguments. This occurs for each directive that Ansible runs. This has two implications: each machine needs to have Python on it, and Ansible is really slow - think one second per directive you put in Ansible.

With Go, we could copy a binary to the remote host and then run it. This would let us take advantage of Go's standard libraries (instead of issuing Unix commands directly via SSH). We could either compile+SCP this binary on the fly, or ship precompiled binaries for each architecture as part of the distribution.

But if we are going to go to that length, why not just add tools to compile the user's entire program, SCP that to the remote filesystem, and run it there?

Run Multiple Directives Per SSH Connection

The only way you are going to get really fast execution is by executing multiple directives/tasks/modules as part of a single SSH connection to the host. The way to achieve the most benefits would be to compile the user's entire configuration program, SCP the binary to the host, then run the binary on the host.

But this requires giving up some flexibility as well. Some Ansible tasks involve copying data from the remote machine to a local machine - for example, mysql_db in target mode. You can do this over the SSH connection, but it might be tricky to separate output that's part of control flow - e.g. "RUN: add group wheel" - from output that's supposed to be copied to the local machine — e.g. a mysql dump. Alternatively, if you need to copy a file from the local machine to the remote machine, you need to bundle that file as part of the target you SCP to the remote machine.

For Go specifically, you'd either need a Go binary on the remote machine, and then copy all of the source files, or you'd need to cross compile the source on the local machine, which means things like user.Current() won't work.

Conclusion

There are a few thorny problems that weren't immediately apparent when I started working on this. For the moment I'm going to try to proceed with the Go solution, porting over an existing set of Ansible directives, and I'm going to try to prioritize speed of remote execution.

But I'm much less confident is going to work well, without a lot of effort.

Liked what you read? I am available for hire.

An API Client that’s Faster than the API

Posted on November 7, 2016

For the past few weeks I've been working on Logrole, a Twilio log viewer. If you have to browse through your Twilio logs, I think this is the way that you should do it. We were able to do some things around performance and resource conservation that have been difficult to accomplish with today's popular web server technologies.

Fast List Responses

Twilio's API frequently takes over 1 second to return a page of calls or messages via the API. But Logrole usually returns results in under 100ms. How? Every 30 seconds we fetch the most recent page of Calls/Messages/Conferences and store them in a cache. When we download a page of resources, we get the URL to the next page - Twilio's next_page_uri — immediately, but a user might not browse to the next page for another few seconds. We don't have to wait for you to hit Next to get the results - on the server side, we fetch/cache the next page immediately, so it's ready when you finally hit the Next button, and it feels really snappy.

The cache is a simple LRU cache. We run Go structs through encoding/gob and then gzip before storing them, which takes the size of a cache value from 42KB to about 3KB. At this size, about 8,300 values can fit in 25MB of memory.

var buf bytes.Buffer
writer := gzip.NewWriter(&buf)
enc := gob.NewEncoder(writer)
if err := enc.Encode(data); err != nil {
	panic(err)
}
if err := writer.Close(); err != nil {
	panic(err)
}
c.mu.Lock()
defer c.mu.Unlock()
c.c.Add(key, buf.Bytes())
c.Debug("stored data in cache", "key", key, "size", buf.Len(),
    "cache_size", c.c.Len())

Right now one machine is more than enough to serve the website, but if we ever needed multiple machines, we could use a tool like groupcache to share/lookup cached values across multiple different machines.

Combining Shared Queries

The logic in the previous paragraphs leads to a race. If the user requests the Next page before we've finished retrieving/saving the value in the cache, we'll end up making two requests for the exact same data. This isn't too bad in our case, but doubles the load on Twilio, and means the second request could have gotten the results sooner by reusing the response from first request.

The singleflight package is useful for ensuring only one request ever gets made at a time. With singleflight, if a request is already in progress with a given key, a caller will get the return value from the first request. We use the next page URI as the key.

var g singleflight.Group
g.Do(page.NextPageURI, func() (interface{}, error) {
    // 1) return value from cache, if we've stored it
    // 2) else, retrieve the resource from the API
    // 3) store response in the cache
    // 4) return response
})

This technique is also useful for avoiding thundering herd problems.

Canceling Unnecessary Queries

You've configured a 30 second request timeout in nginx, and a query is taking too long, exceeding that timeout. nginx returns a 504 Gateway Timeout and moves on, but your server is still happily processing the request, even though no one is listening. It's a hard problem to solve because it's much easier for a thread to just give up than to give up and tell everyone downstream of you that they can give up too. A lot of our tools and libraries don't have the ability to do that kind of out of band signaling to a downstream process.

Go's context.Context is designed as an antidote for this. We set a timeout in a request handler very early on in the request lifecycle:

ctx, cancel := context.WithTimeout(req.Context(), timeout)
defer cancel()
req = req.WithContext(ctx)
h.ServeHTTP(w, req)

We pass this context to any method call that does I/O - a database query, a API client request (in our case), or a exec.Command. If the timeout is exceeded, we'll get a message on a channel at ctx.Done(), and can immediately stop work, no matter where we are. Stopping work if a context is canceled is a built in property of http.Request and os/exec in Go 1.7, and will be in database/sql starting with Go 1.8.

This is so nice - as a comparison, one of the most popular npm libraries for "stop after a timeout" is the connect-timeout library, which let you execute a callback after a particular amount of time, but does nothing to cancel any in-progress work. No popular ORM's for Node support canceling database queries.

It can be really tricky to enforce an absolute deadline on a HTTP request. In most languages you compute a timeout as a duration, but this timeout might reset to 0 every time a byte is received on the socket, making it difficult to enforce that the request doesn't exceed some wall-clock amount of time. Normally you have to do this by starting a 2nd thread that sleeps for a wall-clock amount of time, then checks whether the HTTP request is still in progress and kills the HTTP request thread if so. This 2nd thread also has to cleanup and close any open file descriptors.

Starting threads / finding and closing FD's may not be easy in your language but Contexts make it super easy to set a deadline for sending/receiving data and communicating that to a lot of different callers. Then the http request code can clean up the same way it would for any canceled request.

Metrics

I've been obsessed with performance for a long time and one of the first things I like to do in a new codebase is start printing out timestamps. How long did tests take? How long did it take to start the HTTP server? It's impossible to optimize something if you're not measuring it and it's hard to make people aware of a problem if you're not printing durations for common tasks.

Logrole prints three numbers in the footer of every response: the server response time, the template render time, and the Twilio API request time. You can use these numbers to get a picture of where the server was spending its time, and whether template rendering is taking too long. I use custom template functions to implement this - we store the request's start time in its Context, and then print time elapsed on screen. Obviously this is not a perfect measure since we can't see the time after the template footer is rendered - mainly the ResponseWriter.Write call. But it's close enough.

HTML5

Logrole loads one CSS file and one font. I would have had to use a lot more Javascript a few years ago, but HTML5 has some really nice features that eliminate the need for Javascript. For example, there's a built in date picker for HTML5, that people can use to filter calls/messages by date (It's best supported in Chrome at the moment). Similarly you don't need Javascript to play recordings anymore. HTML5 has an <audio> element that will provide controls for you.

I've needed Javascript in only three places so far:

a "click to show images" toggle button where the CSS necessary to implement it would have been too convoluted
a "click-to-copy" button
To submit a user's timezone change when they change it in the menu bar (instead of having a separate "Submit" button).

About 50 lines in total, implemented in the HTML below the elements where it's needed.

Conclusion

Combining these techniques, we get a server that uses little memory, doesn't waste any time doing unnecessary work, and responds and renders a slow data set extraordinarily quickly. Before starting a new project, evaluate the feature set of the language/frameworks you plan to use - whether the ORM/API clients you are planning to use support fast cancelation, whether you can set wall-clock timeouts and propagate them easily through your stack, and whether your language makes it easy to combine duplicate requests.

If you are a Twilio customer I hope you will give Logrole a try - I think you will like it a lot.

Thanks to Bradley Falzon, Kyle Conroy and Alan Shreve for reading drafts of this post. Thanks to Brad Fitzpatrick for designing and implementing most of the code mentioned here.

Liked what you read? I am available for hire.

Election Guide (Part 2) – CA Ballot Propositions, State Senate, more

Posted on October 23, 2016

This is Part 2 of my voter guide. Part 1 covers the 24 San Francisco ballot propositions and city supervisor races.

The deadline to register to vote in California is October 24. I highly recommend you sign up. Click here to register to vote.

A few notes I cover in more detail in Part 1: More housing is the most important issue for me on this year's ballot, and by default I vote "no" on ballot propositions, since I think we shouldn't be deciding policy by statewide or citywide ballot.

California State Initiatives

Prop 51 (School Bonds): Yes

The real story here is that Proposition 13, passed decades ago, limits the state's ability to collect property taxes, enriching a generation of homeowners at everyone else's expense. This is why our schools constantly need more money.

I also wish the Legislature should be able to figure out its budget and prioritize and we didn't have to vote on things like this. I don't feel too strongly in either direction.

Prop 52 (Medi-cal): No

Hospitals pay a required fee to the CA State government (about $5 billion a year). When the State allocates this money for Medi-cal, the federal government provides about $4 billion in matching funds.

In the past the State has diverted some of the hospital fee money to the general fund which hurts 2x - not only does Medical miss out on the fee money, it misses out on the federal matching funds.

This measure would require the hospital fee money to be spent on Medical, which seems reasonable.

I'm upset that we have to vote on this; I would rather the legislature do the right thing. I'm also upset that this amends the state Constitution; I don't think the Constitution should get into the specifics of how things should be funded. I also think we should be trying to loosen the hands of our legislators, not restrict them further, and that they're as aware of the cost of giving up matching funds as voters are.

Prop 53 (Voter Approval for Megaprojects): No

I'm really torn on this. On the one hand, you are putting voters in charge of deciding even more things about what the government does. On the other, megaprojects frequently fail and the majority come in at least 50% over budget (high speed rail is only the most prominent example of this). Politicians also like to build big things so they can have a "legacy" and the history of big things lately has been really mixed - see high speed rail and also the Bay Bridge which has required frequent fixes and may be cracking.

Our politicians might not make great decisions with our money but I think voters would make worse decisions. Note voters approved the first $9 billion of a high speed rail project whose final cost may be upwards of $60 billion, when no real funding source for the other $51 billion was in sight. This would also increase uncertainty and delay the start of any project until the next statewide vote.

Prop 54 (72 Hr Bill Freeze): Yes

I am unhappy that this bill amends the Constitution. But apparently there are numerous instances of state legislators shoehorning special-interest-friendly language at the last second.

There was that budget measure that limited the amount of reserves local school districts could maintain as a cushion against lean times (a gift to the teachers union, which wanted to make those dollars available for immediate spending); the 2009 waiver of environmental rules for a downtown Los Angeles football stadium (on the argument that time was of the essence to secure an NFL team ... the project never broke ground); or the 2011 bill that Democrats rushed through to force all voter initiatives on the November ballot, thus breaking a deal with Republicans to put spending reform on the June 2012 ballot.

Prop 55 (Extending Income Taxes on High Earners): No

The share of tax each resident pays is something that the Legislature should resolve. I also agree with the Chronicle that this measure will increase the variability of revenue in the state budget, which isn't great.

Prop 56 ($2 Cigarette Tax): Yes

In general taxes are a good way to discourage behavior you don't want. Cigarettes are unhealthy and incur significant spillover costs due to secondhand smoke, and the additional burden on the healthcare system from insuring/treating patients with cancer and emphysema.

I would have preferred for the Legislature to vote for this tax as well.

Prop 57 (Parole): No

Many people are serving sentences that are too long and the prisons are overcrowded. But the language is confusing and I don't see why the Legislature can't pass legislation to deal with this issue.

Prop 58 (Local Language Education Flexibility): Yes

Apparently this is on the ballot because it repeals a previous voter-passed initiative from 1998. The worry is that voting Yes will allow students to graduate without mastering English at all, which isn't good. But it seems like all of parents, students and schools want students to learn English, they just don't agree that "all English classes, all the time" is the best way to do it.

Prop 59 (Citizens United): No

I'm voting No because this is a waste of energy and we shouldn't be voting on things like this, not based on any opinion about Citizens United.

Prop 60 (Porn Stars Wear Condoms): No

The practical effect of this bill would be to shift the porn industry in California to Nevada or another nearby state. The porn industry also requires performers to get tested every two weeks. There are problems that probably deserve more scrutiny - the exploitation of performers in some scenarios - but it's not clear that this initiative is the vehicle or the method to fix them.

Prop 61 (Drug Prices): No

I agree with the Chronicle that the right solution here is to make drug prices (and the rates each agency pays) public, instead of ensuring that the prices Medi-cal and the VA pay are the same. I also think there are legitimate concerns about reduced access to necessary drugs and the ability of the Legislature to override this initiative if there are unforeseen problems.

Prop 62 (Repeal Death Penalty): Yes

Prop 66 (Quicken Death Penalty): No

Leaving aside whether it is ethical to put someone to death for crimes they have committed, I am against the death penalty for the following, more practical reasons:

It's entirely possible we have put an innocent person to death, an monstrous miscarriage of justice that should never be allowed to happen.
It's argued that the death penalty deters people from violent crimes. But there's a lot of evidence that deterrence depends much more on the severity and the certainty of punishment. Death, if it comes at all for death row inmates, is applied years or decades after the fact.
There are legitimate concerns about whether execution can be done "humanely" and a number of states have had problems sourcing the drugs used to put people to death.
It's expensive to execute someone, both in pure cost and in the cost of the appeals process - a death sentence must be appealed to the Supreme Court.

Repeal would also save California a significant amount of money.

Prop 63 (Ammunition): No

The biggest effect of a Yes vote would add additional charges for people who would like to buy ammunition. I don't think we need to vote on this.

Prop 64 (Marijuana Legalization): No Position

In general I'd prefer for drugs to be legalized and heavily regulated + taxed, instead of illegal, especially when you consider the potential revenue. I also think criminal sentences for possessing or distributing marijuana should be smaller than they are (the initiative provides for this). However, I'm concerned that marijuana is only as expensive as it is because it is illegal. Marijuana is not an expensive crop, and if it becomes legal to grow the price per ounce could go really low. I'm worried the flat taxes per ounce are too low, and the 15% sales tax should be a flat tax or a guaranteed minimum price per ounce.

The results on public health so far are mixed; one study reports a 7% increase in traffic fatalities for every 1% increase in marijuana consumption. The penalties for drunk drivers are not currently high enough and I'm worried we don't know how to measure whether a driver is high.

On top of this I am worried that the Legislature won't have the flexibility to override a state initiative; any amendments require a 2/3 vote.

Prop 65 (Money from Paper Bags to Environment): No

This directs revenue from grocery bag fees to specific environmental causes. I don't think we should put additional constraints on where the Legislature should direct money, and I don't think we should pass things by state initiative.

Prop 66: No (see #62 above)

Prop 67 (Affirm Plastic Bag Ban): Yes

Proposition 67 is a referendum on the existing bag law (10 cents a bag); a "Yes" vote says "Yes, please keep the law the way it is." I prefer the Legislature to write laws, not California voters, so I am voting Yes.

Superior Court

Victor Hwang, who has experience working as a public defender.

Board of Education

Stevon Cook, Matt Haney, Rachel Norton, Jill Wynns.

Community College Board

Rafael Mandelman, Amy Bacharach, Alex Randolph, Shanell Williams.

Bart Director

Gwyneth Borden, who has been endorsed by the Chronicle and is open to a ban on BART strikes.

California State Senate: Scott Wiener

This is one of the most important races on the ballot due to the difference in quality between the candidates. Wiener is running against Jane Kim, who has opposed numerous housing projects, and is sponsoring some of the poorer propositions on the city ballot. Scott Wiener understands how to build more housing in San Francisco.

Kim also recently sponsored "legacy status" for Luxor Cab Company, which gives them a permanent subsidy from the City of San Francisco. This is a terrific waste of money compounded by the fact the benefit won't do anything for the company's cab drivers, only its 20 or so full time employees. Vote for Scott Wiener.

California State Representative: David Chiu

Chiu is running against Matthew Del Carlo, who does not appear to have policy positions listed anywhere publicly; it's not clear what he would run for, or do in office.

Chiu slammed Governor Brown for including $0 in affordable housing in this year's budget. The housing measures were tied to the Governor's "by right" housing legislation, which would have done more to lower rent/housing prices in San Francisco than any other legislative measure in a decade. It's not clear whether Chiu supported or opposed this measure.

Chiu is running against Matthew Del Carlo, who doesn't have any information about his policy positions listed publicly. I reached out to him multiple times asking him to post these publicly and he's refused to do so.

United States Senate: Kamala Harris

Harris is running for Barbara Boxer's old seat. We really need a California Senator who understands the technology industry and is willing to fight for it; who understands you can't just make a "golden key" to read messages that only the US government can access, as in Dianne Feinstein's horrible encryption bill.

President

Hillary Clinton.

Liked what you read? I am available for hire.

San Francisco Voting Guide – Propositions and Supervisors

Posted on October 15, 2016

I think this is useful and the ballot's complicated so I wanted to share how I'm voting this year. I used several sources to compile this guide:

The SF Chronicle's endorsements - they follow these issues every day.
The ballot book mailed to every voter, especially the text of the law and the main pro/con arguments.

I highly recommend voting by mail. You can feel too rushed or disorganized in the ballot booth, especially in this election, when there are so many things to vote on.

San Francisco Ballot Initiatives

The #1 issue for me in this election is housing. People make a fundamental mistake when analyzing the SF housing market; they see lots of increased demand (maybe 10%) and little increased supply (maybe 1%), and conclude "We're building housing but prices are still rising; the new housing must be causing the price increases." In reality if demand is outpacing supply you'd expect to see prices rise and supply rise, and the new housing stock is preventing the price from rising even faster than it currently is!

I also see a lot of hypocrisy. SF is full of liberals, and social mobility is a traditional liberal plank. In one of the hottest economies in the country, high rents are preventing poor people from moving here and establishing a foothold. Lowering the price of housing in our fastest growing economies is a moral imperative.

San Francisco added 5000 new units this year, and SF condos are 8% cheaper this year than last year. The market rate of rent also slowed from its normal double-digit increase. We need to build on this progress.

I want there to be more housing in San Francisco, of all shapes and sizes. In this election, anything that makes housing more complicated to build is a No; anything that makes housing easier is a Yes. Affordable housing is admirable but isn't a full answer, and gets more expensive as market rent rises. The easiest path to more affordable housing is to lower market rent.

I'll say two other things; in general I am opposed to deciding things by ballot initiative that could be resolved by the Board of Supervisors or the State Senate, since election votes tend to tie the hands of our elected officials, and can require supermajorities to unwind. So all other things being equal I am more likely to vote No on any given ballot measure.

I am also generally opposed to measures that set aside percentages of the budget, or specific dollar amounts, for any cause, no matter how noble. They reduce the flexibility of our elected officials to balance a budget, which is why we elect them in the first place. The percentage of the city budget each interest group would like to reserve for itself would well exceed 100%.

Measure A (School Bond): Yes

Measure B ($99 Parcel Tax for City College): Yes

Measure C (Repurpose Earthquake Bonds for Housing): Yes

The City is sitting on $261 million in unspent earthquake safety bonds and would like to redirect it to housing. This will increase the supply of housing.

Measure D (Short Term Appointment Rules): No

Some replacement public officials are named by the mayor to replace someone else who left their term. This measure would prevent them from running for a full term. I see no reason why appointees should not be allowed to run for a full term. The SF Chronicle opposes this measure.

Measure E (Trees Fund): No

$19 million per year for trees. In the words of the SF Chronicle, "San Francisco is running a near $10 billion budget. The civic bill for tree care is pegged at $20 million. There should be room for this expense without carving out a program that can't be changed."

Measure F (Youth Vote): No

This would let 16 and 17 year olds vote in local elections.

Measure G (Police Oversight): Yes

This would grant additional powers to a citizen review board. I think police organizations have trouble regulating themselves and this is a good step in the right direction.

Measure H (Public Advocate): No

This creates a new elected position with no power to do anything. "It's posturing minus responsibility, a dream job in the political world," according to the Chronicle.

Measure I (Senior Citizen Fund): No

This measure would set aside $38 million a year for programs for senior citizens and adults with disabilities. I support programs for senior citizens, but would rather our elected officials make decisions about the budget, instead of voters.

Measure J (Homeless Housing and Services): No Position

Measure K (Sales Tax Increase): Yes

In general I'd like to see more parcel tax increases and fewer sales tax increases, since the former hit property owners, who have been granted great gifts by Prop 13. They are also politically unpalatable.

Measure L (Muni Board): No

This would let the Board of Supervisors appoint three of the seven members of the Muni Agency. I don't see why the mayor shouldn't appoint members of the Muni Agency.

Measure M (New Housing Committee): No

This would add another layer of approval in the housing approval process, which would make it more difficult to add housing. I am against measures that would make it more difficult to add housing.

Measure N (Noncitizen Resident Voting): No

I'm sympathetic but this would likely be subject to a legal challenge.

Measure O (Office Exemptions): Yes

The city limits new office construction to 950,000 square feet. This is a silly rule, which makes it hard for startups, among others, to rent in San Francisco. This would exempt Candlestick Point development from that square footage rule.

I would like to see similar rules applied to speed housing growth, but there you go.

Measure P (Competitive Bidding for Affordable Housing): No

This makes it more difficult to build housing by discouraging projects that can't get at least three bids. From the Chronicle:

Prop. P obliges the city to seek three bids when offering city land to affordable housing builders. But City Hall already beats the bushes for multiple contenders. By one count, the last 10 projects had at least two bidders. Locking in a three-bid minimum could kill projects which don’t attract that threshold number of entrants. The measure has the potential to stop promising deals, the last thing San Francisco needs.

Measure Q (Prohibit Tent Placement): No

This wouldn't have much practical effect, and won't really help much to address the shortage of beds.

Measure R (Neighborhood Crime Unit): No

This would allocate 3% of the police force for neighborhood crime. Even if this is an issue that could be addressed by this allocation, I don't think the right answer is for the voters to make allocation decisions for the police department.

Measure S (Hotel Money Allocation): No

This would allocate the 8% hotel tax for the arts and for the homeless. In general I'm against allocating revenue for specific purposes; this isn't an exception. I doubt this will matter; the Chronicle has no position and there are no arguments against the measure in the ballot book.

Measure T (Lobbying Rules): Yes

This would add tighter restrictions on what lobbyists are allowed to do and spend to influence votes.

Measure U (Median Income): No

This would help middle income families qualify for affordable housing at the expense of lower income families. Per [the Chronicle][measure-u], "The guidelines for competitive bidding and income qualifications are better left to a process of legislative hearings, study and political compromise that balances the competing goals and concerns. These are not issues to be settled at the ballot box."

The solution here is more housing of all stripes, and hopefully market rate housing that is affordable to middle income families. This wouldn't help.

Measure V (Soda Tax): Yes

Charging a higher price for something is a good way to discourage people from getting it. This strategy has been used very successfully with cigarettes, which cause cancer in others via secondhand smoke; raising the price of cigarettes makes it an expensive habit. The fact that this raises money for the City is an ancillary benefit. The goal of this bill is to make sugary drinks more expensive and non-sugary drinks cheaper by comparison.

I'm also dismayed by the efforts of bill opponents, who have sent volumes of mail and mislead when they call this a "grocery tax." It's a 1 cent per ounce tax on sugary drinks.

Measure W (Higher City Transfer Taxes): No Opinion

The arguments for this measure all seem to say "this will help make City College free", which is very odd since it seems the tax money will go into the General Fund.

The arguments against point out that this also applies to rent controlled buildings and large buildings.

Measure X (Arts Use in New Buildings): No

This would add restrictions if you want to build housing in an area that was formerly used for the arts or certain types of small businesses. We shouldn't be voting on this, and it makes it more difficult to build housing, maybe more so than any other measure on the ballot.

San Francisco Board of Supervisors

District 1: Marjan Philhour

Marjan wants to build more housing of all shapes and sizes to address the area's housing crisis. She's also been endorsed by the Chronicle.

District 3: No Recommendation

Aaron Peskin is the incumbent who is going to win going away. Peskin has held up new housing on several occasions. He's also supported symbolic efforts to oppose Governor Brown's by right legislation, which would have done more for housing growth than any other proposal in a long time. Peskin also believes that you should only be allowed to exceed existing density limits if you build 100% affordable housing, which is a great way to grandstand for affordable housing while ensuring no new housing gets built.

He is being opposed by Tim Donnelly, who supports "respecting building limits", increasing parking, expanding rent control, and "giving residents a voice" because changes have been made "despite overwhelming opposition from the local community." It does not sound like Mr Donnelly is in favor of more housing.

District 5: No Recommendation

London Breed voted against Governor Brown's by right legislation, which would have helped increase the market-rate and affordable housing stock in San Francisco by letting developers build any project that followed local zoning rules and had 20% affordable housing. She also supports affordability requirements that make it difficult to build more housing.

She is being opposed by Dean Preston, who is running against "rent gouging", and supports an "anti-demolition" ordinance for "historic" buildings. Mr. Preston would not make it easier to build more housing in San Francisco.

District 7: Joel Engardio

Engardio is running against Norman Yee, who supports CEQA, a law that is frequently abused to oppose housing. Engardio supports building more housing. "I also know that building more housing will help middle income residents become homeowners -- and we want to keep families from leaving San Francisco. Restricting supply only drives prices higher," he writes.

District 9: No Endorsement

Hillary Ronen pledges to "fight for an affordable San Francisco" and wants to build 5000 units of affordable housing in 10 years. There was a very easy way to have accomplished 5000 units of affordable housing - support Governor Brown's by right housing legislation, which would have guaranteed that 20% of every new building in San Francisco would have been affordable. Her boss, David Campos, voted against it. She also wants to leverage state and federal funds to build affordable housing. Her boss's vote against by right legislation helped remove $400 million for affordable housing from the state budget.

District 11: No Endorsement

None of the candidates in either of these districts seem to agree that building more housing of any shape and size is the best way to alleviate our affordability crisis for everyone. Notably bad is District 11's Kim Alvarenga, running on a platform of "more parking" and "100% affordable housing", which is very difficult to build.

Coming Soon!

California State Propositions, BART director, judicial elections, State Senate and US Senate.

Liked what you read? I am available for hire.

Dumb Tricks to Save Database Space

Posted on September 3, 2016

I have seen a few databases recently that could have saved a lot of space by being more efficient with how they stored data. Sometimes this isn't a big problem, when a table is not going to grow particularly quickly. But it can become a big problem and you can be leaving a lot of disk savings on the table.

Let's review some of the benefits of smaller tables:

Indexes are smaller. This means your database needs less space to index your tables, and more RAM can be used to cache results.
The cache can hold more objects, since the objects are smaller.
You'll delay the point at which your database won't fit on a single disk, and you have to shard.
Query results which might have fit in 2 TCP packets will now fit in one.
Backups complete more quickly.
Your application servers will use less RAM to hold the result.
Migrations complete more quickly.
Full table searches complete more quickly.

Let's review some common data types and strategies for holding these. If these are obvious to you - great! You can stop reading at any point. They're not obvious to a lot of people.

A brief reminder before we get started - a bit is a single 0 or 1, and a byte is a series of 8 bits. Every ASCII character can be represented in a single byte.

UUID's

It's common to store UUID's as text fields in your database. A typical UUID representation - "ad91e02b-a147-4c47-aa8c-1f3c2240c0df" - will take up 36 bytes and more if you store it with a prefix like SMS or user_. A UUID uses only 16 different characters (the hyphens are for display only, like hyphens in a phone number). This means you only need 4 bits to store a UUID character. There are 32 characters in a UUID, so can fit a UUID in 16 bytes, a saving of 55%. If you're using Postgres, you can use the uuid data type to store UUID's, or the binary data type - in MySQL, you can use a binary(16).

CREATE TABLE users (id uuid PRIMARY KEY);

It's often useful to store a prefix with a UUID, so you know what the ID represents from looking at it - for example, SM123 or user_456. I wrote a short library that stores a prefix with a UUID, but strips it before writing to the database. To read UUID's out of the database with a prefix, attach them to the SELECT statement:

SELECT 'user_' || id FROM users LIMIT 5;

My old team at Shyp recently converted text ID's to UUID's and wrote about that process on their engineering blog.

UUID's in JSON

It's becoming more common to store relational data in JSON or JSONB columns. There are a lot of reasons to do this or not do this - I don't want to rehash that discussion here. JSONB does lead to inefficient data storage for UUID's, however, since you are limited to storing characters that are valid JSON. If you are storing UUID's this means you can't get down to 16 bytes, since you can't just any byte in JSON. You can base64 encode your 16 byte UUID. In Go, that encoding dance looks something like this:

import "encoding/base64"
import "encoding/hex"
import "strings"
rawUUID := "ad91e02b-a147-4c47-aa8c-1f3c2240c0df"
// Strip the hyphens
uuidStr := strings.Replace(rawUUID, "-", "", 4)
// Decode the hex string into a slice of 16 bytes.
bits, _ := hex.DecodeString(uuidStr)
// Re-encode that 16-byte slice using base64.
fmt.Println(base64.RawURLEncoding.EncodeToString(bits))

That outputs rZHgK6FHTEeqjB88IkDA3w, which is only 22 bytes, a 38% improvement.

Use smaller numbers

A default Postgres integer is 4 bytes and can hold any number from -2147483648 to 2147483648. If you know that the integer you are storing is never going to exceed 32,760, you can use a smallint (2 bytes) to store it and save 2 bytes per row.

Use an enum type

Let's say you have a subscription that can have one of several states (trial, paid, expired). Storing the strings "trial", "paid", "expired" in the database can take up extra space. Instead use an enum type, which is only 4 bytes (1 byte in MySQL) and ensures you can't accidentally write a bad status like "trail". Another alternative is to store a smallint and convert them to values that make sense in the application, but this makes it harder to determine what things are if you're querying the database directly, and doesn't prevent mistakes.

Use binary for password hashes

Most password hashing algorithms should give you back raw bytes. You should be able to store the raw bytes directly in the database using bytea.

Move fields out of JSON columns

One downside of JSON/JSONB is that the key gets stored alongside the value for each row in the application. If you are storing a boolean like {"show_blue_button": true} in JSON, you're using 18 bytes per row to store the string "show_blue_button" and only one bit to store the boolean true. If you store this field in a Postgres column, you are only using one or two bits per row. Moving this to a column pays off in terms of space even if you only need the show_blue_button boolean once every 70-140 rows. It's much easier to add indexes on columns than JSON fields as well.

Conclusion

That's it! A small amount of thought and work upfront can pay big dividends down the line. Migrating columns after they're already in place can be a pain. In general, the best approach is to do the following:

Add the new column with the correct type.
Edit your application to write/update both the new and the old column.
Backfill the new column, copying over all values from the old column for old records in the database. If the table is large, do this in batches of 1000 rows or so to avoid locking your table for too long.
Edit your application to read exclusively from the new column.
Drop the old column.

I hope this helps!

Inspired by some tweets from Andrey Petrov and a Heap Analytics post about JSONB.

Liked what you read? I am available for hire.

More Comment-Preserving Configuration Parsers

Posted on August 2, 2016

For the past few weeks I've been on the hunt for a configuration file format with the following three properties:

You can use a library to parse the configuration. Most configuration formats allow for this, though some (nginx, haproxy, vim) aren't so easy.
You can manipulate the keys and values, using the same library.
When that library writes the file to disk, any comments that were present in the original config file are preserved.

Why bother? First, allowing programs to read/write configuration files allows for automated cleanup/manipulation. Go ships with a first-class parser/AST, and as a consequence there are many programs that can lint/edit/vet your source code. These wouldn't be possible without that ast package and a set of related tools that make parsing and manipulating the source easy.

You can imagine installers that could automatically make a change to your configuration; for example, certbot from the Let's Encrypt project tries to automatically edit your Apache or Nginx configuration. This is an incredibly difficult task, due to the complexity of the configuration that have piled up over the years, and that those configuration files weren't built with automatic editing in mind.

Backwards incompatible changes are never good, but their downsides can be mitigated by effective tools for parsing and updating configuration.

You want comments in your configuration file because configurations tend to accumulate over the years and it can be incredibly unclear where values came from, or why values were set the way they were. At Twilio, the same HAProxy config got copied from service to service to service, even though the defined timeouts led to bad behavior. Comments allow you to provide more information about why a value is set the way it is, and note values where you weren't sure what they should be, but had to pick something besides "infinity" before deploying.

What problems do you run into when you try to implement a comment-preserving configuration parser? A lot of config parsers try to turn the file into a simple data type like a dictionary or an array, which immediately loses a lot of the fidelity that was present in the original file. The second problem there is that dictionaries in most languages do not preserve ordering so you might write out the configuration in a different order than you read it, which messes up git diffs, and the comment order.

You are going to need to implement something that is a lot closer to an abstract syntax tree than a dictionary; at the very least maps of keys and values should be stored as an array of tuples and not a dictionary type.

The next problem you run into is that syntax trees are great for preserving the fidelity of source code but tend to be unwieldy when all you want to do is index into an array, or get the value for a key, especially when the type of that value may take any number of values - a number, a string, a date, or an array of the above. The good news is configuration files tend to only need a subset of the syntax/fidelity necessary for a programming language (you don't need/want functions, for example) so you can hopefully get away with defining a simpler set of interfaces for manipulating data.

(Incidentally I realized in the course of researching this that I have written two libraries to do this - one is a library for manipulating your /etc/hosts file, and the other is a library for bumping versions in Go source code. Of course those are simpler problems than the one I am trying to solve here).

So let's look at what's out there.

JSON is very popular, but it's a non-starter because there's no support for comments, and JSON does not define an ordering for keys and values in a dictionary; they could get written in a different order than they are read. JSON5 is a variant of JSON that allows for code comments. Unfortunately I haven't seen a JSON5 parser that maintains comments in the representation.
YAML is another configuration format used by Ansible, Salt, Travis CI, CircleCI and others. As far as I can tell there is exactly one YAML parser that preserves comments, written in Python.
XML is not the most popular format for configuration, but the structure makes it pretty easy to preserve comments. For example, the Go standard library parser contains tools for reading and writing comments. XML seems to have the widest set of libraries that preserve comments - I also found libraries in Python and Java and could probably find more if I looked harder.
TOML is a newer format that resembles YAML but has a looser grammar. There are no known parsers for TOML that preserve comments.
INI files are used by windows programs, and the Python configparser module, among others. I have found one parser in Perl that tries to preserve comments.
Avro is another configuration tool that is gaining in popularity for things like database schema definitions. Unfortunately it's backed by JSON, so it's out for the same reasons JSON is out.
You can use Go source code for your configuration. Unfortunately the tools for working with Go syntax trees are still pretty forbidding, for tasks beyond extremely simple ones, especially if you want to go past the token representation of a file into actually working with e.g. a struct or an array.

I decided on [a configuration file format called hcl], from Hashicorp. It resembles nginx configuration syntax, but ships with a Go parser and printer. It's still a little rough around the edges to get values out of it, so I wrote a small library for getting and setting keys in a configuration map.

This is difficult - it's much easier to write a parser that just converts to an array or a dictionary, than one that preserves the structure of the underlying file. But I think we've only scratched the surface of the benefits, with tools like Go's auto code rewriter and npm init/npm version patch. Hopefully going forward, new configuration formats will ship with a proper parser from day one.

Liked what you read? I am available for hire.

Real Life Go Benchmarking

Posted on July 2, 2016

I've been following the commits to the Go project for some time now. Occasionally someone will post a commit with benchmarks showing how much the commit improves performance along some axis or another. In this commit, they've increased the performance of division by 7 (a notoriously tricky number to divide by) by about 40% on machines running the ARM instruction set. I'm not 100% sure what the commit does, but it switches around the instructions that get generated when you do long division on ARM in a way that makes things faster.

Anyway, I wanted to learn how to do benchmarks, and practice making something faster. Some motivation came along when Uber released their go-zap logging library, with a set of benchmarks showing my friend Alan's logging library as the worst performer. So I thought it would be a good candidate for benchmarking.

Fortunately Alan has already included a set of benchmarks in the test suite. You can run them by cloning the project and then calling the following:

go test -run=^$ -bench=.

You need to pass -run=^$ to exclude all tests in the test suite, otherwise all of the tests will run and also all of the benchmarks. We only care about the benchmarks, so -run=^$ filters out every argument.

We get some output!

BenchmarkStreamNoCtx-4                   	  300000	      5735 ns/op
BenchmarkDiscard-4                       	 2000000	       856 ns/op
BenchmarkCallerFileHandler-4             	 1000000	      1980 ns/op
BenchmarkCallerFuncHandler-4             	 1000000	      1864 ns/op
BenchmarkLogfmtNoCtx-4                   	  500000	      3866 ns/op
BenchmarkJsonNoCtx-4                     	 1000000	      1816 ns/op
BenchmarkMultiLevelFilter-4              	 2000000	      1015 ns/op
BenchmarkDescendant1-4                   	 2000000	       857 ns/op
BenchmarkDescendant2-4                   	 2000000	       872 ns/op
BenchmarkDescendant4-4                   	 2000000	      1029 ns/op
BenchmarkDescendant8-4                   	 1000000	      1018 ns/op
BenchmarkLog15AddingFields-4             	   50000	     29492 ns/op
BenchmarkLog15WithAccumulatedContext-4   	   50000	     33599 ns/op
BenchmarkLog15WithoutFields-4            	  200000	      9417 ns/op
PASS
ok  	github.com/inconshreveable/log15	30.083s

The name on the left is the benchmark name. The number (4) is the number of CPU's used for the benchmark. The number in the middle is the number of test runs; to get a good benchmark, you want to run a thing as many times as feasible, then divide the total runtime by the number of test runs. Otherwise you run into problems like coordinated omission and weighting the extreme outliers too much, or failing to weight them at all.

To get a "good" benchmark, you want to try to isolate the running code from anything else running on the machine. For example, say you run the tip benchmarks while a Youtube video is playing, make a change, and then run the benchmarks while nothing is playing. Video players require a lot of CPU/RAM to play videos, and all other things being equal, the benchmark is going to be worse with Youtube running. There are a lot of ways to accidentally bias the results as well, for example you might get bored with the tip benchmarks and browse around, then make a change and observe the console intensely to see the new results. You're biasing the results simply by not switching tabs or screens!

If you have access to a Linux machine with nothing else running on it, that's going to be your best bet for benchmarking. Otherwise, shut down as many other programs as are feasible on your main machine before starting any benchmark tests.

Running a benchmark multiple times can also be a good way to compensate for environmental effects. Russ Cox's benchstat program is very useful for this; it gathers many runs of a benchmark, and tells you whether the results are statistically significant. Run your benchmark with the -count flag to run it multiple times in a row:

go test -count=20 -run=^$ -bench=Log15AddingFields | tee -a master-benchmark

Do the same for your change, writing to a different file (change-benchmark), then run benchstat:

benchstat master-benchmark change-benchmark

You'll get really nice looking output. This is generally the output used by the Go core development team to print benchmark results.

$ benchstat benchmarks/tip benchmarks/early-time-exit
name                 old time/op  new time/op  delta
StreamNoCtx-4        3.60µs ± 6%  3.17µs ± 1%  -12.02%  (p=0.001 n=8+6)
Discard-4             837ns ± 1%   804ns ± 3%   -3.94%  (p=0.001 n=7+6)
CallerFileHandler-4  1.94µs ± 2%  1.88µs ± 0%   -3.01%  (p=0.003 n=7+5)
CallerFuncHandler-4  1.72µs ± 3%  1.65µs ± 1%   -3.98%  (p=0.001 n=7+6)
LogfmtNoCtx-4        2.39µs ± 2%  2.04µs ± 1%  -14.69%  (p=0.001 n=8+6)
JsonNoCtx-4          1.68µs ± 1%  1.66µs ± 0%   -1.08%  (p=0.001 n=7+6)
MultiLevelFilter-4    880ns ± 2%   832ns ± 0%   -5.44%  (p=0.001 n=7+6)
Descendant1-4         855ns ± 3%   818ns ± 1%   -4.28%  (p=0.000 n=8+6)
Descendant2-4         872ns ± 3%   830ns ± 2%   -4.87%  (p=0.001 n=7+6)
Descendant4-4         934ns ± 1%   893ns ± 1%   -4.41%  (p=0.001 n=7+6)
Descendant8-4         990ns ± 2%   958ns ± 2%   -3.29%  (p=0.002 n=7+6)

OK! So now we have a framework for measuring whether a change helps or hurts.

How can I make my code faster?

Good question! There's no one answer for this. In general, there are three strategies:

Figure out a way to do less work than you did before. Avoid doing an expensive computation where it's not necessary, exit early from functions, &c.
Do the same work, but in a faster way; use a different algorithm, or use a different function, that's faster.
Do the same work, but parallelize it across multiple CPU's, or threads.

Each technique is useful in different places, and it can be hard to predict where you'll be able to extract performance improvements. It is useful to know how expensive various operations are, so you can evaluate the costs of a given operation.

It's also easy to spend a lot of time "optimizing" something only to realize it's not the bottleneck in your program. If you optimize something that takes 5% of the code's execution time, the best overall speedup you can get is 5%, even if you get the code to run instantaneously. Go's test framework has tools for figuring out where your code is spending the majority of its time. To get the best use out of them, focus on profiling the code execution for a single benchmark. Here, I'm profiling the StreamNoCtx benchmark in the log15 library.

$ go test -cpuprofile=cpu.out -benchmem -memprofile=mem.out -run=^$ -bench=StreamNoCtx -v
BenchmarkStreamNoCtx-4   	  500000	      3502 ns/op	     440 B/op	      12 allocs/op

This will generate 3 files: cpu.out, mem.out, and log15.test. These are binary files. You want to use the pprof tool to evaluate them. First let's look at the CPU profile; I've started it and then run top10 to get the top 10 functions.

$ go tool pprof log15.test cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top5
560ms of 1540ms total (36.36%)
Showing top 5 nodes out of 112 (cum >= 60ms)
      flat  flat%   sum%        cum   cum%
     180ms 11.69% 11.69%      400ms 25.97%  runtime.mallocgc
     160ms 10.39% 22.08%      160ms 10.39%  runtime.mach_semaphore_signal
     100ms  6.49% 28.57%      260ms 16.88%  github.com/inconshreveable/log15.escapeString
      60ms  3.90% 32.47%       70ms  4.55%  bytes.(*Buffer).WriteByte
      60ms  3.90% 36.36%       60ms  3.90%  runtime.stringiter2

The top 5 functions are responsible for 36% of the program's runtime. flat is how much time is spent inside of a function, cum shows how much time is spent in a function, and also in any code called by a function. Of these 5, runtime.stringiter2, runtime.mallocgc, and runtime.mach_semaphore_signal are not good candidates for optimization. They are very specialized pieces of code, and they're part of the Go runtime, so changes need to pass all tests and get approved by the Go core development team. We could potentially figure out how to call them less often though - mallocgc indicates we are creating lots of objects. Maybe we could figure out how to create fewer objects.

The likeliest candidate for improvement is the escapeString function in our own codebase. The list function is super useful for checking this.

(pprof) list escapeString
ROUTINE ======================== github.com/inconshreveable/log15.escapeString in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
      30ms      330ms (flat, cum) 23.40% of Total
      10ms       10ms    225:func escapeString(s string) string {
         .          .    226:	needQuotes := false
         .       90ms    227:	e := bytes.Buffer{}
         .          .    228:	e.WriteByte('"')
      10ms       50ms    229:	for _, r := range s {
         .          .    230:		if r <= ' ' || r == '=' || r == '"' {
         .          .    231:			needQuotes = true
         .          .    232:		}
         .          .    233:
         .          .    234:		switch r {
         .          .    235:		case '\\', '"':
         .          .    236:			e.WriteByte('\\')
         .          .    237:			e.WriteByte(byte(r))
         .          .    238:		case '\n':
         .          .    239:			e.WriteByte('\\')
         .          .    240:			e.WriteByte('n')
         .          .    241:		case '\r':
         .          .    242:			e.WriteByte('\\')
         .          .    243:			e.WriteByte('r')
         .          .    244:		case '\t':
         .          .    245:			e.WriteByte('\\')
         .          .    246:			e.WriteByte('t')
         .          .    247:		default:
         .      100ms    248:			e.WriteRune(r)
         .          .    249:		}
         .          .    250:	}
         .       10ms    251:	e.WriteByte('"')
         .          .    252:	start, stop := 0, e.Len()
         .          .    253:	if !needQuotes {
         .          .    254:		start, stop = 1, stop-1
         .          .    255:	}
      10ms       70ms    256:	return string(e.Bytes()[start:stop])

The basic idea here is to write a string, but add a backslash before double quotes, newlines, and tab characters. Some ideas for improvement:

We create a bytes.Buffer{} at the beginning of the function. We could keep a Pool of buffers, and retrieve one, so we don't need to allocate memory for a buffer each time we escape a string.
If a string doesn't contain a double quote, newline, tab, or carriage return, it doesn't need to be escaped. Maybe we can avoid creating the Buffer entirely for that case, if we can find a fast way to check whether a string has those characters in it.
If we call WriteByte twice in a row, we could probably replace it with a WriteString(), which will use a copy to move two bytes, instead of growing the array twice.
We call e.Bytes() and then cast the result to a string. Maybe we can figure out how to call e.String() directly.

You can then look at how much time is being spent in each area to get an idea of how much any given idea will help your benchmarks. For example, replacing WriteByte with WriteString probably won't save much time; you're at most saving the time it takes to write every quote and newline, and most strings are made up of letters and numbers and space characters instead. (It doesn't show up at all in the benchmark but that's because the benchmark writes the phrase "test message" over and over again, and that doesn't have any escape-able characters).

That's the CPU benchmark; how much time the CPU spends running each function in the codebase. There's also the memory profile; how much memory each function allocates. Let's look at that! We have to pass one of these flags to pprof to get it to show memory information.

Sample value selection option (for heap profiles):
  -inuse_space      Display in-use memory size
  -inuse_objects    Display in-use object counts
  -alloc_space      Display allocated memory size
  -alloc_objects    Display allocated object counts

Let's pass one. Notice in this case, the top 5 functions allocate 96% of the objects:

go tool pprof -alloc_objects log15.test mem.out
(pprof) top5
6612331 of 6896359 total (95.88%)
Showing top 5 nodes out of 18 (cum >= 376843)
      flat  flat%   sum%        cum   cum%
   2146370 31.12% 31.12%    6612331 95.88%  github.com/inconshreveable/log15.LogfmtFormat.func1
   1631482 23.66% 54.78%    1631482 23.66%  github.com/inconshreveable/log15.escapeString
   1540119 22.33% 77.11%    4465961 64.76%  github.com/inconshreveable/log15.logfmt
    917517 13.30% 90.42%    1294360 18.77%  github.com/inconshreveable/log15.formatShared
    376843  5.46% 95.88%     376843  5.46%  time.Time.Format

Let's look at a function:

ROUTINE ======================== github.com/inconshreveable/log15.logfmt in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
   1540119    4465961 (flat, cum) 64.76% of Total
         .          .     97:		if i != 0 {
         .          .     98:			buf.WriteByte(' ')
         .          .     99:		}
         .          .    100:
         .          .    101:		k, ok := ctx[i].(string)
         .    2925842    102:		v := formatLogfmtValue(ctx[i+1])
         .          .    103:		if !ok {
         .          .    104:			k, v = errorKey, formatLogfmtValue(k)
         .          .    105:		}
         .          .    106:
         .          .    107:		// XXX: we should probably check that all of your key bytes aren't invalid
         .          .    108:		if color > 0 {
         .          .    109:			fmt.Fprintf(buf, "\x1b[%dm%s\x1b[0m=%s", color, k, v)
         .          .    110:		} else {
   1540119    1540119    111:			fmt.Fprintf(buf, "%s=%s", k, v)
         .          .    112:		}
         .          .    113:	}

In the common case on line 111 (when color = 0), we're calling fmt.Fprintf to write the key, then an equals sign, then the value. Fprintf also has to allocate memory for its own byte buffer, then parse the format string, then interpolate the two variables. It might be faster, and avoid allocations, to just call buf.WriteString(k), then write the equals sign, then call buf.WriteString(v). But you'd want to benchmark it first to double check!

Conclusion

Using a combination of these techniques, I was able to improve the performance of log15 by about 30% for some common code paths, and reduce memory allocations as well. I was not expecting this, but I also found a way to speed up JSON encoding in the Go standard library by about 20%!

Want someone to benchmark/improve performance in your company's application, or teach your team more about benchmarking? I am available for consulting; email me and let's set something up!

Liked what you read? I am available for hire.

Cleaning up Parallel Tests in Go 1.7

Posted on June 15, 2016

I have a lot of tests in Go that integrate with Postgres, and test the interactions between Go models and the database.

A lot of these tests can run in parallel. For example, any test that attempts to write a record, but fails with a constraint failure, can run in parallel with all other tests. A test that tries to read a random database ID and expects to not fetch a record can run in parallel with other tests. If you write your tests so they all use random UUID's, or all run inside of transactions, you can run them in parallel. You can use this technique to keep your test suite pretty fast, even if each individual test takes 20-40 milliseconds.

You can mark a test to run in parallel by calling t.Parallel() at the top of the test. Here's an example test from the job queue Rickover:

func TestCreateMissingFields(t *testing.T) {
  t.Parallel()
  test.SetUp(t)
  job := models.Job{
    Name: "email-signup",
  }
  _, err := jobs.Create(job)
  test.AssertError(t, err, "")
  test.AssertEquals(t, err.Error(), "Invalid delivery_strategy: \"\"")
}

This test will run in parallel with other tests marked Parallel and only with other tests marked Parallel; all other tests run sequentially.

The problem comes when you want to clear the database. If you have a t.Parallel() test clean up after it has made its assertions, it might try to clear the database while another Parallel() test is still running! That wouldn't be good at all. Presumably, the sequential tests are expecting the database to be cleared. (They could clear it at the start of the test, but this might lead to unnecessary extra DB writes; it's better for tests that alter the database to clean up after themselves).

(You can also run every test in a transaction, and roll it back at the end. Which is great, and gives you automatic isolation! But you have to pass a *sql.Tx around everywhere, and make two additional roundtrips to the database, which you probably also need to do in your application).

Go 1.7 adds the ability to nest tests. Which means we can run setup once, run every parallel test, then tear down once. Something like this (from the docs):

func TestTeardownParallel(t *testing.T) {
  // This Run will not return until the parallel tests finish.
  t.Run("group", func(t *testing.T) {
    t.Run("Test1", parallelTest1)
    t.Run("Test2", parallelTest2)
    t.Run("Test3", parallelTest3)
  })
  // <tear-down code>
}

Note you have to lowercase the function names for the parallel tests, or they'll run inside of the test block, and then again, individually. I settled on this pattern:

var parallelTests = []func(*testing.T){
  testCreate,
  testCreateEmptyPriority,
  testUniqueFailure,
  testGet,
}
func TestAll(t *testing.T) {
  test.SetUp(t)
  defer test.TearDown(t)
  t.Run("Parallel", func(t *testing.T) {
    for _, parallelTest := range parallelTests {
      test.F(t, parallelTest)
    }
  })
}

The test mentioned there is the set of test helpers from the Let's Encrypt project, plus some of my own. test.F finds the defined function name, capitalizes it, and passes the result to test.Run:

// capitalize the first letter in the string
func capitalize(s string) string {
  r, size := utf8.DecodeRuneInString(s)
  return fmt.Sprintf("%c", unicode.ToTitle(r)) + s[size:]
}
func F(t *testing.T, f func(*testing.T)) {
  longfuncname := runtime.FuncForPC(reflect.ValueOf(f).Pointer()).Name()
  funcnameparts := strings.Split(longfuncname, ".")
  funcname := funcnameparts[len(funcnameparts)-1]
  t.Run(capitalize(funcname), f)
}

The result is a set of parallel tests that run a cleanup action exactly once. The downside is the resulting tests have two levels of nesting; you have to define a second t.Run that waits for the parallel tests to complete.

=== RUN   TestAll
=== RUN   TestAll/Parallel
=== RUN   TestAll/Parallel/TestCreate
=== RUN   TestAll/Parallel/TestCreateEmptyPriority
=== RUN   TestAll/Parallel/TestUniqueFailure
=== RUN   TestAll/Parallel/TestGet
--- PASS: TestAll (0.03s)
    --- PASS: TestAll/Parallel (0.00s)
        --- PASS: TestAll/Parallel/TestCreate (0.01s)
        --- PASS: TestAll/Parallel/TestCreateEmptyPriority (0.01s)
        --- PASS: TestAll/Parallel/TestUniqueFailure (0.01s)
        --- PASS: TestAll/Parallel/TestGet (0.02s)

The other thing that might trip you up: If you add print statements to your tear down lines, they'll appear in the console output before the PASS lines. However, I verified they run after all of your parallel tests are finished running.

Liked what you read? I am available for hire.

Buying stocks without a time limit

Posted on June 6, 2016

Recently I put the maximum amount of cash into my IRA account. Since stock prices jump up and down all the time, I wondered whether the current price would be the best one to buy the stock at. In particular, I'm not withdrawing money from my IRA for the better part of 40 years, so I'm not going to sell it — I want the absolute lowest price possible.

Stock markets offer a type of order that's useful for this - you can put in a limit order, which says essentially "I'll buy N shares at price X", where X is some price below the last traded price of the stock. Anyone can take the other side of that order at any time, which usually happens when the stock price drops below the limit order! So you can put in a limit order at any time and it will execute whenever the stock price drops below that amount. I'm simplifying slightly but that's the gist.

So if the current stock price is X, what price should I put in my limit order for? You should be able to come up with some kind of model of the stock price and how it moves and say, roughly, "If you set your limit order at Z, there will be a ~98% chance of it executing in the next year." The normal criticism of stock price models is they underestimate the probability of extreme events, which is alright for us, since extreme events make it more likely our limit order will trigger.

I figured someone studied this problem before so I asked about it on Quant Stack Exchange and I got a really good answer! The answer pointed to two different papers, which have attempted to model this.

Essentially the model looks at the volatility of the stock - how often it jumps up and down - the stock's current price, the probability you want the order to hit, and the amount of time you want to wait. The formula is a little complex but it looks like this:

I wrote some code to compute these values for the stocks I wanted to buy. The hardest part is figuring out the volatility, because you need access to a lot of stock prices, and you'll want to compute the standard deviation from that. I was able to get historical price data from Quandl, and current stock prices from Yahoo.

Combine the two and you get output that looks like this:

$ go run erf.go main.go --stock=<stock> --total=5500 --percent 0.95
the standard deviation is: 0.006742
annualized: 0.107022
current price: 74.95
Based on this stock's volatility, you should set a limit order for: $74.35.

Compared with buying it at the current price, you'll be able to buy 0.6 extra shares (a value of $43.77)

Pretty neat! I've put the code up on Github, so you can take a look for yourself.

Liked what you read? I am available for hire.