The two-man rule in engineering

In nuclear weapons design, there is a two-man rule that prevents any single individual from accidentally — or maliciously — launching nuclear weapons. Each step requires knowledge and consent from two individuals to proceed. Even when the President initiates a launch order, he must jointly authenticate with the Secretary of Defense (they’re given separate codes, even though the President has sole authority).

When the order reaches the launch control center, two people are required to authenticate and initiate the launch, for example by (vastly simplifying…) turning two keys simultaneously.

The benefits are at least twofold. First, it’s much harder to compromise or impersonate two people simultaneously than it is to compromise one. Second, it also provides error correction. When two people are involved in a process, it’s much more likely that if someone is about to make an oversight or error, it will be caught. This works better when the roles are asymmetric, because then they won’t both be on the same “wavelength.” Most good processes of this type seem to be asymmetric in some way.

There are many contexts where we want error correction and extra security: executing large financial transfers, preparing patients for surgery, performing space shuttle launch checks, or running nuclear reactors. It also comes up a lot in software development, which is what got me thinking about this. Let’s count the ways we implement the two man rule:

Code review: Everyone is either doing this or making bad excuses for why they shouldn’t. But it’s the clearest and most accessible example of a two-man rule in software engineering.

Spec review: An essential part of any sizable project is a review of the specification to make sure, in particular, that 1) the right thing is being built in the right way, and 2) the right people and teams are aware of any impact the work might have on them.

Continuous integration: The branch built on your machine, but does it build on another one? This turns up countless “oh right I added this config variable/package and forgot to propagate the change” incidents before they become blocking.

Pair programming: I think of this as just real-time code review. It has all the same benefits and more, with the downside that it can’t be done asynchronously.

Deployments: I wish we did this closer to 100% of the time, but it has definitely been helpful to have a second person on hand for deployments in addition to the primary engineer. This is especially critical during complex deployments that happen in phases or involve many moving parts. Ideally the role is relegated to going through the checklist one last time (“says there are database migrations, are we expecting downtime or can we keep pre-boot on, and if so is the config correct?”), and in the event of an issue, helping to investigate or doing the checklist in reverse to roll back.

Mind the Gap

As we continue to grow, there are a few areas where I think a more consistent two-man rule will lead to high return on effort in the future:

  • manually rebooting servers, changing server counts or container types
  • adding/scaling services
  • running one-off commands against the production database

And yes, every once in a blue moon we deploy tiny changes to production without full code review, or force a failing build onto staging — something that is intentionally difficult and unwieldy to do. This has gone from rare to extremely rare, and I expect this trend to continue. But I like processes to be developed and enforced bottom-up if possible, and prefer values over inflexible rules. So far this tenet hasn’t failed us, and we still trust each other with good judgment above all else.

However, as the stakes get higher every day, the cost/benefit equation will eventually tip towards a standard operating procedure that can be summarized as “trust, but verify.” If that doesn’t sound like a good proverb to live by, maybe a second opinion is in order?

 

Don't tweak all the variables at once

I have been at Privy for a year. I’m proud of the team and product we’ve built, and I was excited to sit down and make a list of some of the new things I learned during my time here. Then I realized that most of these “lessons” would’ve been covered if I had just re-read everything ever written by Fred Brooks, Martin Fowler and Eric Ries…but that doesn’t make a good blog post.

So that got me thinking about the things I already sorta-knew that had been validated. Perhaps there was some pattern there. And so I made my first order list, which I present below.

I have learned virtually nothing about…

  1. Using a stack in the middle of the adoption curve: Ruby on Rails.
    • Ruby/MRI is between 2 and 50x slower than running a static language on the JVM, but even a slight increase in developer productivity more than makes up for the operations cost.
    • The advantage of using a really fancy stack (more cool factor for recruiting, etc) really doesn’t seem to compare favorably to the disadvantages (more uncertainty, smaller pool of technical talent).
    • The evidence that startups regularly die due to technology stack is vanishingly flimsy, so no need to dwell here.
  2. Building a local team.
    • Geographically distributed teams and getting on the bandwagon of “work anywhere cuz we have Slack lol” seems all the rage today, but the early team is more important than the early product, and the best teams are in the same place every day.
    • Resisting the urge to go remote has been something of a useful filtering mechanism: does this individual believe enough in our vision to consider moving here for the job?[1]
  3. Having some really solid cultural values (or aspirations, as they may be) that aren’t totally groundbreaking.
    • It’s more important that we live up to great values than come up with amazing ones. I’ll leave the latter to the management consultants.
  4. Using traditional engineering management.
    • We basically do agile: there are weekly-ish sprints; we do higher level planning on a monthly basis; a couple times a year we work on a strategic roadmap. We write software specifications before we code, and we ship daily with continuous integration and lower test coverage than I’d like to admit. Yawn.
    • We don’t use “flat” organizations or Holacracy or whatever trendy hipster management structure is in vogue. What the hell kind of problem is this trying to solve anyway? My theory is it’s got something to do with cool factor for recruiting, but I have a feeling the people trying this are no more certain than I am.

What’s the big meta lesson here?

If anything, it probably goes a little bit like this: the available levers to pull in a startup are numerous, but there are only a few that make a measurable difference. The things that are most likely to kill us are the things that kill most startups: having a subpar team, building a product that nobody wants, executing poorly on feedback loops, that kind of thing.

These are the things that, in Paul Graham terminology, make you “default dead” until you figure out how to get them right. And it’s critically important to realize that things like “what do we build?” and “who do we sell it to?” are the things that startups are doing “wrong by default” and need to diagnose and fix as quickly as possible.

But then there are the other things, like “how do we write a scalable system to respond to HTTP requests?” or “how should we manage engineering teams?” in which there are essentially no forced errors, and where (barring a well-articulated exception[2]) the correct answer is the default one. So almost all of the risks here seem to be to the downside, and any upside is probably insignificant compared to the scale and difficulty of the hard problem: building a novel product under uncertainty.

There are certainly going to be exceptions to this. There are going to be teams that have figured out how to deviate from orthodoxy and are reaping benefits from it. I’m OK with this, and my theory is that it either doesn’t matter (e.g., they were going to be a success anyway) or it won’t rescue them (they’re doomed and they didn’t differentiate in a way that mattered).

And so it must follow that the majority of our iterating and tweaking is on the thing that will make us a great company: what do we build? Who do we sell it to? There are enough variables in there that I don’t really have any brainpower left over to do anything except reach for Generic Ruby/Python/JavaScript framework and using engineering/recruiting/management techniques that were old 30 years ago.

 

[1] This isn’t all roses, since it biases us significantly towards younger folks who don’t have as many attachments, the net effect of which is…debatable, but obviously not lethal in a vibrant tech city like Boston.

[2] Example: One excuse I’ve used to provision real hardware in a real datacenter as opposed to just spinning up an EC2 instance is “I’ve done the math and TCO in AWS is literally 25X more expensive.”

How to uninstall the default Windows 10 apps and disable web search

If you’re like me, you’ve been enjoying Windows 10 for quite some time now. Couple things annoy me:

1. I accidentally changed all my file associations to the new default Windows apps, because the (intentionally) misleading firstrun experience presented fine print I glossed over.
2. I don’t like searching the web from the Windows Start menu, because I’d rather not transmit everything I type there over the network. Call me old fashioned.

Remove default apps

Open up a powershell prompt and run this to remove most of the default apps:

Get-AppxPackage *onenote* | Remove-AppxPackage
Get-AppxPackage *zunevideo* | Remove-AppxPackage
Get-AppxPackage *bingsports* | Remove-AppxPackage
Get-AppxPackage *windowsalarms* | Remove-AppxPackage
Get-AppxPackage *windowscommunicationsapps* | Remove-AppxPackage
Get-AppxPackage *windowscamera* | Remove-AppxPackage
Get-AppxPackage *skypeapp* | Remove-AppxPackage
Get-AppxPackage *getstarted* | Remove-AppxPackage
Get-AppxPackage *zunemusic* | Remove-AppxPackage
Get-AppxPackage *windowsmaps* | Remove-AppxPackage
Get-AppxPackage *soundrecorder* | Remove-AppxPackage

Turn off Web Search

Next, open up Group Policy Editor (gpedit.msc) and navigate to:

Computer Configuration -> Administrative Templates -> Windows Components -> Search. Enable the policies:

  • Do not allow web search
  • Don’t search the web or display web results in Search
  • Don’t search the web or display web results in Search over metered connections

Finally, open up “Cortana and Search Settings” and disable “Search online and enable web results”.

Heroku Pricing Changes

Couple of quick points on Heroku’s pricing changes which I’ve been meaning to get out:

  • Its not an across-the-board price cut. While the dyno pricing has decreased, they also got rid of the free $36ish/month in free dyno credits.
  • New free tier replaces the free dyno credit. Minimum 6 hours of sleep per day means no more abusing the free tier by pinging your app every few minutes to keep it from sleeping. Seems a lot of people were doing this to run production apps for free; good riddance.
  • New $7/month hobby tier is a great new option for people who were previously hosting production apps for free and need them live 24/7. This is a great deal since you can even have worker/background dynos for the same price. Makes sense for Heroku too – they’ll derive a good deal of long tail revenue from folks who would’ve previously just stuck with the free tier (maybe using the ping hack to prevent idling). Honestly I think the revenue is not the point – it’s more just preventing people from abusing the free tier while giving enough folks a no-excuses carrot to use the platform so it’ll be a no-brainer when they “go pro.”
  • Professional dyno pricing drop is great, but it’s going to be a wash for the majority of paying users because the free credit is going away. Basically there’s no more big cliff where you go from free->paid any more, but the steepness of the pricing increases is somewhat lower. My intuition is the winners are the 4-5 figure/month customers, makes sense since that’s around the time they start thinking about moving to AWS directly for cost savings. More of them will just consider staying.

Why Work at a Startup?

Because I’m tired of explaining to everyone, I’m going to make this list to refer to anyone who asks. While I don’t think any of these are particularly original, it makes a handy checklist for anyone considering a similar jump[1].

  •  Faster time to market. At Privy, we routinely ship code that was written earlier in the day or week. Seems petty, but as an engineer, it’s frustrating to improve something and then not have it in the hands of customers for weeks or months.
  • More hats to wear. The diversity of work at a startup appeals to me. I can work on product, recruiting, and engineering. Before lunch. The pace and scope of work is both faster and longer term, and I like being involved in multiple parts of the business.
  • Be judged by customers, not managers. A startup makes each person less insulated from the market. Therefore the correlation between performance and rewards tends to be much closer.
  • Less politics. As a consequence of the last point, politics becomes less important. It’s much harder to bullshit accomplishments in a startup when the entire company fits into a small room or two. Tired of carrying teammates who aren’t pulling their own weight? Join a startup.
  • Incredible learning. As another corollary to being closer to market forces, I’ve learned a lot about how to run a business that provides value to customers in exchange for money. I’ve in turn been able to apply experience I’ve learned elsewhere that I never would’ve been able to use at a larger company, because my job title would’ve prevented me from doing anything other than engineering.
  • Challenging the status quo, not defending it. Name recognition is cool, but I never got the sense that my role at Office was about reshaping how people work – probably because our market share had nowhere to go but down. But I’ve found I don’t mind playing the underdog as long as I have a thesis about how the future should change for the better.

 

1. In a necessary but not sufficient way (i.e. if these don’t apply to you, a startup is probably a bad idea; but if they do apply to you, a startup could still be a bad idea).

Don't get a Masters in Computer Science

I am pretty sure most software engineers should get a BS in computer science. I’ve written extensively about this. But I’m often asked by prospective engineers whether it’s worth the effort to get the MS too. In the past I’ve mostly dodged on this, with a hedged answer I would charitably paraphrase as “umm, probably no, but maybe yes, if you find a subfield you really like.”

Today I realized that this is terrible advice. If you have to ask, you should not get a master’s degree in computer science.

Why? Because all you MS CS candidates suck at the most basic interviews.

Seriously.

Like I sometimes have trouble differentiating between people with an MS and people who have literally never coded in their lives. But maybe that’s because they aren’t mutually exclusive:

  • I don’t do this anymore, but I used to just ask fizzbuzz over the phone, and the candidates who routinely failed this were either masters students or masters grads looking for their first job.
  • For some MS CS grads, reversing a string is literally a half-hour affair, and doing it in-place without an O(n) memory allocation is considered “tricky.”
  • I once had a poor soul with a masters degree spend 10 minutes failing to name a way to communicate between 2 computers.

I don’t know what’s going on here.

But I have a few theories:

1) Software engineering experience compounds, but instruction in CS fundamentals offers diminishing returns after 4 years. I might be suffering from some Dunning-Kruger here as I only have a BS, but the vast majority of fundamental, broadly applicable theory seems to taper out after ~3 years of quality instruction, in my experience.

2) MS programs lack even remotely standardized curriculum or admissions requirements. Master’s programs seem to fall into two camps: the “we’re vetting you for a PhD” camp, and the “professional degree” camp (which is very likely a cash cow for the university). Both camps assume you have prior exposure to the subject matter, and therefore won’t have a well-structured curriculum in fundamentals. But if an MS CS program doesn’t teach CS fundamentals (that’s what the BS is for, right?), and doesn’t require a BS CS for admission, how does that ensure graduates have a baseline level of knowledge upon graduation? It doesn’t.

3) MS students have low or no exposure to actual coding. A lot of MS degree work I’ve seen either involved studying esoteric algorithms or mathematical proofs, or research that mostly involved bragging about how the machine running a neural network has 256GB of RAM. I took a few graduate level courses back in my day, and I’d venture at least half of them required no coding whatsoever. Now recall the part about no structured curriculum, and you are well on your way to a choose-your-own-adventure degree that could easily see you to graduation day writing about about as much code as a real engineer might deploy to production before lunch today.

Of course, it goes without saying this isn’t all candidates from all schools. But it is a pattern, and these days I just reflexively de-prioritize talking to MSCS candidates because to do otherwise is a setup for disappointment.

The truth is, I suspect this state of affairs is a mix of correlation and causation. I know it’s wrong, but “if this candidate was any good, he would’ve gotten a job on the strength of his skills rather than making his resume fancier while waiting out the recession or whatever” has crept into the back of my mind before.

It’s simple. We, uh, kill the batman.

It doesn’t really have to be this way. If your goal is to be the best engineer that you can, those 2ish years of extra experience you get in the industry make a big difference. Those are your learning years where you absorb hard-won experience from your seniors on engineering trade-offs and how to work on teams with existing codebases under real multidimensional constraints.

And if your goal is to make the most money you can, an MS almost never pays off unless you just happened to specialize in something that is both rare and highly in demand. Otherwise, if you are lucky, you are looking at, compared to a fresh BS CS grad, a pay bump of ~$10k. Maybe. Forget comparing to someone who graduated with the BS CS one or two years ago; they’ve left you in the dust.

This should be obvious, if you think about it for a moment. New grad engineers increase their skills and value tremendously over 2 years; they get commensurate increases in salary to reflect this[1], and the average person who took those 2 years to get an MS CS is starting from an experience deficit and never catches up. It’s no wonder then that it only offers a ~$5-10k salary bump: it isn’t all that valuable on its own.

So don’t get a master’s degree[2]. It probably won’t pay off, and your engineering career will suffer. There are exceptions, but they don’t apply to Joe Shmoe with an MS from Nowheresville.

[1] Mostly by changing jobs, because employers in this industry seem to routinely under-level new grad engineers as they gain experience, but that’s another rant for another time.

[2] But if you do, get a BS CS first. I see again and again that most successful people with master’s degrees started with the BS.

How we sped up our background processing 150x

Performance has always been an obsession of mine. I enjoy the challenge of understanding why things take as long as they do. In the process, I often discover that there’s a way to make things faster by removing bottlenecks. Today I will go over some changes we recently made to Privy that resulted in our production application sending emails 150x faster per node!

Understanding the problem

When we starting exploring performance in our email queueing system, all our nodes were near their maximum memory limit. It was clear that we were running as many workers as we could per machine, but the CPU utilization was extremely low, even when all workers were busy.

Anyone with experience will immediately recognize that this means these systems were almost certainly I/O bound. There’s a couple obvious ways to fix this. One is to perform I/O asynchronously. Since these were already supposed to be asynchronous workers, this didn’t seem intuitively like the right answer.

The other option is to run more workers. But how do you run more workers on a machine already running as many workers as can fit in memory?

Adding more workers

We added more workers per node by moving from Resque to Sidekiq. For those who don’t know, Resque is a process-based background queuing system. Sidekiq, on the other hand, is thread-based. This is important, because Resque’s design means a copy of the application code is duplicated across every one of its worker processes. If we wanted two Resque workers, we would use double the memory of a single worker (because of the copy-on-write nature of forked process memory in linux, this isn’t strictly true, but it was quite close in our production systems due to the memory access patterns of our application and the ruby runtime).

Making this switch to Sidekiq allowed us to immediately increase the number of workers per node by a factor of roughly 6x. All the Sidekiq workers are able to more tightly share operating system resources like memory, network connections, and database access handles.

How did we do?

This one change resulted in a performance change of nearly 30x (as in, 3000% as fast).

Wait, what?

Plot twist!

How did running more workers also result in a performance increase of 500% per worker? I had to do some digging. As it turns out, there’s a number of things that make Resque workers slower:

  • Each worker process forks a child process before starting each job. This takes time, even on a copy-on-write system like linux.
  • Then, since there are now two processes sharing the same connection to redis, the child has to reopen the connection.
  • Now, the parent will have to wait on the child process to exit before it can check the queue for the next job to do.

When we compounded all of these across every worker, it turns out these were, on average, adding a multiple-seconds-long penalty to every job. There is almost certainly something wrong here (and no, it wasn’t paging). I’m sure this could’ve been tuned and improved, but I didn’t explore since it was moot at this point anyway.

Let’s do better – with Computer ScienceTM

In the course of rewriting this system, we noticed some operations were just taking longer than felt right. One of these was the scheduling system: we schedule reminder emails to be sent out in redis itself, inserting jobs into a set that is sorted by time. Sometimes things happen that require removing scheduled emails (for example, if the user performs the action we were trying to nudge them to do).

While profiling the performance of these email reminders, I noticed an odd design: whenever the state of a claimed offer changes (including an email being sent), all related scheduled emails are removed and re-inserted (based on what makes sense for this new state). Obviously, this is a good way to make sure that anything unnecessary is removed without having to know what those things are. I had a hunch: If the scheduled jobs are sorted by time, how long would it take to find jobs that aren’t keyed on time?

O(n). Whoops!

It turns out that the time it took to send an email depended linearly on how many emails were waiting to be sent. This is not a recipe for high scalability.

We did some work to never remove scheduled jobs out of order – instead, scheduled jobs check their validity during runtime and no-op if there is nothing to do. Since no operations depend linearly on the size of the queue any more, its a much more scalable design.

By making this change, we saw an increase in performance of more than 5x in production.

Summing up

  • Moving from process-based to thread-based workers: ~6x more workers per node.
  • Moving from forking workers to non-forking workers: 5x faster.
  • Removing O(n) operations from the actual email send job: 5x faster.
  • Total speedup: Roughly 150x performance improvement.

Compounding Advantages

The biggest myth about successful people is the “overnight success.” There’s basically no such thing. This is a great platitude, which happens to be true, but how can we deconstruct it down to its quintessential lesson?

The first point of order is to understand where advantages that lead to success come from. They might come from raw talent – which I won’t focus on, because it isn’t something you can control for (and experience is often confused with raw talent, because they look the same to outsiders). Or they might come from external sources – such as growing up with good financial security, in a two-parent household, in a well-off neighborhood with good schools. Those types of advantages are mostly out of your control as well, so that’s out too. Finally, there is experience.

Experience is the advantage most under your control. When most people ask me for advice about careers in computer science, they often know they are at a disadvantage (often because they are switching career tracks), but aren’t sure of the most efficient way to erase that deficit. But what appears to be an insurmountable disadvantage is usually the result of years of hard work, or a lack thereof.

So how does one gain experience without any experience? Isn’t that like the some sort of catch-22?

Not really. If it were, then by definition the industry couldn’t possibly exist, now could it?

(Normally, when people claim that it’s a catch-22, they’re just being unrealistic about what types of jobs are actually entry-level, or, more likely, they aren’t willing to do what it takes to become qualified for entry level jobs. In fact, software engineering is one of the easiest jobs to gain experience in, because all you need is a keyboard and monitor that eventually connects to the internet, and some free time. So whining about it is just immature.)

This isn’t really an essay on how to get into software engineering, since I’ve already written a bit on that topic. But there is a recurring theme, which is that it takes consistent application of conscious effort to build and maintain the credentials to become an engineer. And most importantly, all experience advantages start small, and compound over time. So the best way to become the best engineer is to start coding, a lot. Today.

Why coding?

Because while software engineering is about much, much more than just coding, coding is the most important part. It’s the only part you can’t skip. It’s also one of the easiest skills to show off and test for.

OK. So what should you code?

There’s no one-size-fits-all answer, but here’s a few starting points:

1) Go to Codecademy and start one of the courses. It almost doesn't matter which one, since they're all pretty solid.
Pros: Structured learning with helpful hints and explanations, sense of progression.
Cons: Toy problems that don't require reading existing code as much as the other options, an extremely useful skill.
2) Take a Coursera course (core concepts with programming involved -- data structures, algorithms, operating systems).
Pros: Online-classroom environment, instructor-led with a focus on fundamentals.
Cons: Academic in nature, which is actually sort of a plus, but it won't maximize lines/code per day.
3) Download a release of Ruby on Rails and start a web app.
Pros: Good documentation and explicit best-practices, more "realistic" than some guided courses.
Cons: Undirected learning. Requires product management to design things to code, which is a distraction. Too much Ruby/Rails "magic" abstracts away important concepts.
4) Browse Github (etc) and find an open source project to contribute to.
Pros: Working on released software, chance to interact with other coders. Most "realistic" experience.
Cons: Reading code is significantly harder than writing code.
5) Download the iOS / Android SDK and create a mobile app.
Pros: Everyone loves mobile.
Cons: Learning programming, a programming language, how to read documentation, and a complex API at the same time can be extremely overwhelming.

So…About that degree thing

I’m of the opinion that most software engineers should get a Bachelor’s in Computer Science. I’ve hammered on this point before. There are exceptions though. Like, do you know your computer science fundamentals (data structures, algorithms, operating systems, programming paradigms, software lifecycles)? Do you have practical software engineering experience (e.g., measured in years), doing work that shipped?

If not, I still recommend a CS degree, because it’s an excellent signaling mechanism, and you can complete one full-time in less than the traditional 4 years. However, coding boot camps have been all the rage lately, and I wanted to touch on them briefly.

Basically, coding boot camps are an excellent option for many people (and I know of many who have successfully gone this route), but I don’t recommend them in general because the best engineers aren’t minted in 12 weeks. It’s a different story if you already have some experience under your belt, but don’t want to get a full-on BSCS. But in that case, a coding boot camp generally isn’t really tailored for you anyway, since most programs don’t require existing experience by design. And that means you lose the benefits of a compounding advantage by not building on existing experience.

This is the main advantage of following a degree-granting program. It starts with the fundamentals, and then builds on that foundation with programming experience and core theory, leveraging your existing knowledge.

Boom.

You gain a small advantage, compounding itself.

Why I picked Microsoft over Amazon

It’s interviewing season, and that means people are going to get offers really soon. I’ve been wanting to write a blurgh post about my decision to pick Microsoft over Amazon for some time now, and I’ve been asked for my reasoning a couple times. So maybe I can help others make the right choice.

I may be rationalizing my decision in hindsight, but it turns out there were a number of advantages Microsoft has over Amazon; here is the view from 10,000 feet:

  1. Substantially better benefits (health, wellness, employee stock purchase plan, 401k matching, perks), and slightly better overall compensation. You can increase your cash income an additional 5-8% risk-free by taking full advantage of ESPP, 401k, and other game-y things with your health benefits.
  2. Generous relocation package + annual performance bonus. It makes up for not getting a hiring bonus at least.
  3. Stock vesting is substantially faster (for Amazon, stock vesting is all backloaded so the last 80% or so vests in years 3 and 4). At MSFT the vesting is a linear 25% every year.
  4. I get my own office, and work-life balance is generally better. Microsoft’s median employee tenure backs this up.
  5. No on-call rotations[1]. Annual performance bonuses in cash, in addition to stock. Did I mention work-life balance?

And a handy table I put together, mostly from a combination of sources (stars denote uncertainty) and my highly scientific opinion:

Microsoft Amazon favors
Onboarding
Relocation (from east coast) all-expenses-paid or $5000 cash, tax-assisted (2011) all-expenses-paid or $7500 cash, tax-assisted Amazon
Signing bonus None in 2011, there may be a small one now ~25% base in 2 installments, pro-rated for 2 years Strongly Amazon
Hiring Stock Grant ~60% of base, vesting: 25% per year ~50% base, vesting: 5% 1st yr, 15% 2nd yr, then 20% every 6 months Microsoft
Compensation
Base salary 60-75th percentile (on average, industry norm +15%) 50-75th percentile (on average, industry norm +10%) Leaning Microsoft
Base salary increase 0-9%, 3.5-4% is typical on average less than 3.5% Microsoft
Annual cash bonus On average 10% of base usually none Strongly Microsoft
Annual stock grants < 10% of base Between 10-15% of base* Amazon
Career
Promotions see trajectory discussion see trajectory discussion Amazon
Benefits
401k matching 50% of contributions up to 6% of base salary (3% match) 50% of contributions up to 4% of base salary (2% match) Microsoft
Employee Stock Purchase Plan 10% discount, purchases capped at 15% of base salary none Strongly Microsoft
Other fringe benefits Prime Card, free onsite health screenings, various health incentives & rewards, charity+volunteering match, discounted group legal plan for routine legal work 10% off up to $1000 in Amazon.com purchases per year Strongly Microsoft
Health see health benefits discussion see health benefits discussion Leaning Microsoft
Kitchen Soft drinks, milk, juice, tea, on-demand Starbucks, espresso Tea, powdered cider, drip coffee Leaning Microsoft
Time off 3 weeks vacation, 10 paid holidays, 2 personal days 2 weeks vacation (3wks after 1st year), 6 paid holidays, 6 personal days Microsoft
Culture
Location Redmond Seattle Strongly Amazon
Tools/Platforms Closed source Microsoft stack, proprietary. Many legacy desktop platforms, lots of new services Open source Linux stack. Almost entirely services-based, many legacy concerns. Best-in-class deployment tools. Strongly Amazon
On-call Expected of most engineers (unless product has no services component, increasingly unlikely) Expected of most engineers Leaning Microsoft
Median Age 33 32
Median Tenure 4.0 years 1.0 years

Career Trajectory

The great thing about Microsoft is that there’s always a career path for people who want to become valued individual contributors. However, you should be aware that the difficulty level ramps up pretty quickly. Generally, most ICs are unlikely to earn the title of Senior SDE in less than 4-5 years, and Microsoft will rarely consider someone for a lead engineer (the first rung in the management ladder[2]) who has fewer than 6-7 years under his belt. However, the promotions don’t stop just because you don’t want to be a manager – excellent ICs can earn titles like Principal Engineer, Distinguished Engineer, and Technical Fellow who are respected and valued as much as Corporate Vice Presidents.

At Amazon, expect a lot of responsibility to ramp up fairly quickly, along with somewhat higher chances for advancement — both because Amazon is growing faster, and because it has higher rates of attrition (I suspect attrition is higher at the bottom than the top; but I have no evidence for this). Three years out of college is not atypical for being offered SDM I (first rung on management track). This is partly because of the horrible retention; by the time you hit 3 years, you’re more tenured than about 80% of the company. Anecdotally, I have heard talented Microsoft ICs on the management track note to me that specific Amazon counterparts are progressing faster (to development manager) than themselves. So if management track progression is your goal — pick Amazon, not Microsoft.

Health Benefits

Microsoft in 2014

  • 100% preventative care covered, always.
  • HSP ($1000-$2500 annual employer HSA contribution, $1500-$3750 deductible; $1000-$2500 coinsurance) or HMO (no deductible/limited coinsurance, copays of $20-$100 for outpatient service)
  • full or partial dental coverage + payroll credit
  • vision: free annual eye exam and up to $225 of vision hardware per year; lasik benefit
  • free gym membership OR up to $800 in cash reimbursement for fitness purchases OR $200 cash
  • free life insurance – 2x annual base pay
  • long term disability insurance – 60% of monthly income up to $15,000
  • optional accidental death & dismemberment

Amazon in 2014
Documents I got my hands on weren’t heavy on details. I’m just going to go out on a limb here and say Microsoft’s health benefits are better. Here is a copypasta from their careers page that tells you approximately nothing about how they compare to Microsoft:

  • A choice of four medical plans, including prescription drug coverage, designed to meet your individual needs, with domestic partner coverage
  • Dental plan
  • Vision plan
  • Company-paid basic life and accident coverage as well as optional coverage at a low cost
  • Company-paid short- and long-term disability plan
  • Employee assistance program including dependent-care referral services and financial/legal services
  • Health-care and dependent-care flexible spending accounts

Fringe Benefits

Allow me a moment to blow you away with the absurd benefits Microsoft offers. Prime Card gives you random discounts on everything from Apple products to local restaurants. It also gets you discounted admission (I think $5?) to IMAX movies. Microsoft hosts free onsite health screenings for general health, flu shots, glucose/cholesterol testing, etc — and even gives away gift cards for attending. They have a charity matching program – they’ll match dollar for dollar every contribution you give to registered charities, and also pay $18/hr to any charity you volunteer at to increase your impact. There’s a discounted group legal plan that costs, I think, something like 30-40 dollars a month for routine legal work. There’s tuition reimbursement. There’s generous paid maternity (AND paternity) leave.

Amazon discounts 10% (up to $100 off) of annual Amazon.com purchases, which is cool too, I guess.

Commuting, Culture & Tools

Microsoft runs free shuttles to most major residential areas nearby – the largest private bus system in the world, in fact. On top of that, they provide a free ORCA card for unlimited free travel on the local bus system.

Amazon also has a free ORCA card on offer, but only a limited private shuttle system between campuses.

Amazon is in Seattle proper, where anything vaguely resembling nightlife happens; Microsoft is on the so-called Eastside across a narrow bridge where basically nothing does. This is not an insignificant issue for many people who work at Microsoft but want to live in Seattle – this is likely to extend your commute by at least 30 minutes each way.

As far as tools go, both companies have first-rate toolchains. Amazon probably leads here, as they have a very impressive toolset, dependency management system, and deployment process. On the other hand, Microsoft’s approach to the software engineering process is both much more disciplined, and less flexible. They produce some of the finest program managers. And almost all their tools are closed-source, so you’re unlikely to be using, say, git, unless you work at Amazon. The downside of Amazon’s agility is a sometimes chaotic software development process; getting stuck on a team with a mandate to improve a service while simultaneously fixing bad architecture/rush job warts are not uncommon, and unrewarding.

Work-life balance is manageable at both companies. I’ve had a number of 60 hour weeks, maybe even a few 70-hour weeks near shipping time. They were out of the norm. I’m inclined to say Microsoft requires fewer hours on average than Amazon, where people might see 45-50 / week as closer to normal. Everyone will tell you “how much work you get done” matters more than “how many hours you put in.” This is a half-truth – you need to put in the right amount of face time, don’t be on either side of the bell curve.

TL;DR:
If you want to work in a fast-paced environment leading the way in services, cross your fingers every time you deploy, and don’t mind getting paged in the middle of the night, work for Amazon. If you want to make slightly more money shipping desktop software (or deploy services like you would ship desktop software), and pretend with your 100,000 coworkers that the company is becoming “agile,” work for Microsoft.

Thanks to all the fine folks who answered my questions and reviewed early drafts of this.

Notes

[1] This is no longer true on many teams at Microsoft. For the most part I hear its not as bad as at Amazon, but there can be what I charitably call “rough patches” when a team implements on call rotations for the first time, and invariably screw things up until the alert frequency can be tuned correctly (ASK ME HOW I KNOW).
[2] Microsoft recently moved away from formal lead positions as of 4Q 2014, bringing it into alignment with most other companies like Google and Amazon. Basically all ICs report to a dev manager now, and a “lead” engineer has no direct reports any more, but has de facto authority over a project or team. This doesn’t change the fact that progressing from an IC to a manager at Microsoft is both very hard and takes a long time.

The State of Securing HTTP in 2014

In a post-Snowden world, it’s not unreasonable to ask that every site consider deploying “all HTTPS, all the time,” even if just to troll the guys who really want to track what videos I’m watching on Youtube. Here’s how the process breaks down, based on my research and experience. This is not a guide, but a general overview of the state of the art.

Performance was a major reservation I had going into this. So I’m happy to say: SSL/TLS is not only inexpensive, it’s ridiculously cheap. In my case, CPU load increase was on the order of 1%, which is a rounding error. Memory usage might have gone up by a couple KB per connection, and there was no noticeable increase in network overhead. In an end-to-end test, a browser with a cold cache actually loaded and rendered secure pages in (statistically speaking) the same amount of time. On the server side: visually it’s impossible to tell from perf graphs when SSL was enabled. Serving > 6 million hits a month.

There are a couple of things needed to achieve this transparent performance level while still achieving high security, and it is almost entirely configuration-dependent with very little having to do with the application itself.

Session resumption

SSL and TLS support an abbreviated handshake in lieu of a full one when the client has previous session information cached. The full handshake takes two roundtrips (plus one from the TCP handshake), but session resumption can save the server from doing an RSA operation for the client key exchange, as well as a roundtrip. But the big win is the reduction of a roundtrip.

Here is where things start to get weird. There’s at least two forms of session resumption. The session identifier method is baked into SSL and is therefore supported by default – enable this on the server and it should Just Work. The one issue with this is that the server must maintain a session cache. Worse, if you have multiple nodes in the backend you need to either move this session cache into a shared pool or implement some kind of session affinity (urgh). Or perhaps you could do something even stupider and have SSL terminate at your load balancer, defeating the purpose of having a loadbalancer.

Luckily, TLS has an optional extension (described in RFC 5077) that makes the client store the session resumption data, including the master secret that was negotiated and the cipher. This session ticket is further encrypted and readable only by the server to prevent tampering. In short, implementing TLS session tickets as a resumption protocol gives you the best of both worlds – a faster, abbreviated handshake without requiring the server to cache anything. Unfortunately this is an optional extension to TLS, so it is not supported by all clients and servers, and a pool of servers needs to be configured properly so they can share tickets by using a common encryption key (unless you have session affinity, in which case it doesn’t matter).

Certificate Chains

Most certificates require an intermediate certificate to be presented along with them. There could even by several intermediates. Ensure the following:

1) Send all required certificates to validate the chain. If you don’t, things will probably still work because browsers will tolerate anything short of genocide, but they’ll probably do a DNS/TCP/HTTP dance in the middle of your TLS handshake to some other server to grab the certificate, which is obviously no good.

2) Don’t send any extraneous certificates. Besides the obvious slowdown from sending unnecessary data, you *might* in some cases cause an extra roundtrip if you overflow your TCP window and end up having to wait for an ACK before sending more data. So yeah, fewer packets matter here.

OCSP Stapling

OCSP is the Online Certificate Status Protocol. When a client receives the server’s certificate, it normally connects to the certificate authority to ask if the certificate has been revoked. You can save the client a roundtrip to the CA server if you enable the OCSP stapling extension, in which the server periodically connects to the CA to perform the OCSP check itself, then staples this response to the client during the handshake.

This works because the CA’s response is both time-stamped and signed. Clients can be assured that the OCSP response has therefore not been tampered with, nor can it be used in a primitive replay attack since it has an expiration tied to its validity.

There are some aggravating issues here. One is that OCSP stapling only allows one response to be stapled at a time, which is problematic for certificate chains and will probably end with the client making its own OCSP calls anyway. Another is that OCSP responses can be relatively heavy (like 1KB-ish), which combined with the certificates themselves can overflow the TCP window and cause a roundtrip for ACKs, as mentioned above.

Cipher Suites

This part can get filled with conjecture and hypotheticals pretty quickly, but basically you have two choices to make:

1) Cipher: A stream cipher like RC4 is fast and doesn’t require padding; this saves bytes on the wire. A block cipher like AES will need padding bytes, but may be more secure. AES-256 is overkill for a 1024-bit public key, but that’s not really a concern since most keys are 2048-bit now.

2) Key exchange: RSA or DHE+RSA? With ephemeral Diffie-Hellman support you can enable Perfect Forward Secrecy, which prevents recorded traffic in the past from being decrypted even if the private key is compromised. This is really powerful and really secure, but it also breaks debugging tools like Wireshark since having the private key doesn’t help you decrypt the traffic. You’ll also handshake at about half the speed of pure RSA.

The reality is that it’s more likely you’ll get pwned by some buffer overflow or Heartbleed-type bug than have someone factor your key, so keep this in mind when selecting your key exchange algorithm and cipher.

Other Considerations

You may need HSTS (HTTP Strict Transport Security) if you have concerns about SSLStrip vulnerabilities. SSLStrip is a man-in-the-middle attack in which the MITM silently redirects requests to non-HTTP pages and then simply copies the transmitted plaintext data. You can implement “HSTS” at the application level selectively by detecting and forwarding to the HTTPS version of your page. But if you are doing this nonselectively, why write more code to do something slower higher up in the stack? A lack of protection at both the application and protocol level will result in fun things like this though: https://codebutler.com/firesheep/

Application level changes may be required. Some code is more prone to breaking or causing issues when served over HTTPS. Poorly written Javascript for example can break if it isn’t expecting HTTPS, and external services that don’t support HTTPS can be problematic. Specifically, all images/resources need protocol-agnostic links AND need to be served securely as well, otherwise you’ll get an annoying “insecure page” warning in the browser since not every resource was sent securely.

Older browsers like IE 6/7/8 screw everything up and I don’t think they even support TLS; you can either support IE 6 / Windows XP or your connection can be secure.

There’s some moderate, but not insurmountable, IT overhead to enabling SSL/TLS. Apache and Nginx support SSL/TLS either out of the box or via easily enabled modules, so it isn’t hard to set them up and configure them to use OpenSSL. But configuring the options as explained above is paramount. So is keeping OpenSSL and other components up to date.

You also need to remember to renew and swap out expiring certs, as well as securely store and back up private keys, cert signing requests, passwords and certificate authority information as well. This is because in (one|three|five) years, it’ll be time to remove the expired certificate and deploy a new one, so you better A) remember where you got your certificate; B) remember how to log in and buy, request, and obtain a new one; and C) test, deploy, and restart your services using the new certificate.

TL;DR:

You should enable SSL. It’s not that hard, performance is a non-issue, and you can buy a cert for about $15 a year. Put a date on your calendar to renew it.