Why I picked Microsoft over Amazon

It’s interviewing season, and that means people are going to get offers really soon. I’ve been wanting to write a blurgh post about my decision to pick Microsoft over Amazon for some time now, and I’ve been asked for my reasoning a couple times. So maybe I can help others make the right choice.

I may be rationalizing my decision in hindsight, but it turns out there were a number of advantages Microsoft has over Amazon; here is the view from 10,000 feet:

  1. Substantially better benefits (health, wellness, employee stock purchase plan, 401k matching, perks), and slightly better overall compensation. You can increase your cash income an additional 5-8% risk-free by taking full advantage of ESPP, 401k, and other game-y things with your health benefits.
  2. Generous relocation package + annual performance bonus. It makes up for not getting a hiring bonus at least.
  3. Stock vesting is substantially faster (for Amazon, stock vesting is all backloaded so the last 80% or so vests in years 3 and 4). At MSFT the vesting is a linear 25% every year.
  4. I get my own office, and work-life balance is generally better. Microsoft’s median employee tenure backs this up.
  5. No on-call rotations[1]. Annual performance bonuses in cash, in addition to stock. Did I mention work-life balance?

And a handy table I put together, mostly from a combination of sources (stars denote uncertainty) and my highly scientific opinion:

Microsoft Amazon favors
Relocation (from east coast) all-expenses-paid or $5000 cash, tax-assisted (2011) all-expenses-paid or $7500 cash, tax-assisted Amazon
Signing bonus None in 2011, there may be a small one now ~25% base in 2 installments, pro-rated for 2 years Strongly Amazon
Hiring Stock Grant ~60% of base, vesting: 25% per year ~50% base, vesting: 5% 1st yr, 15% 2nd yr, then 20% every 6 months Microsoft
Base salary 60-75th percentile (on average, industry norm +15%) 50-75th percentile (on average, industry norm +10%) Leaning Microsoft
Base salary increase 0-9%, 3.5-4% is typical on average less than 3.5% Microsoft
Annual cash bonus On average 10% of base usually none Strongly Microsoft
Annual stock grants < 10% of base Between 10-15% of base* Amazon
Promotions see trajectory discussion see trajectory discussion Amazon
401k matching 50% of contributions up to 6% of base salary (3% match) 50% of contributions up to 4% of base salary (2% match) Microsoft
Employee Stock Purchase Plan 10% discount, purchases capped at 15% of base salary none Strongly Microsoft
Other fringe benefits Prime Card, free onsite health screenings, various health incentives & rewards, charity+volunteering match, discounted group legal plan for routine legal work 10% off up to $1000 in Amazon.com purchases per year Strongly Microsoft
Health see health benefits discussion see health benefits discussion Leaning Microsoft
Kitchen Soft drinks, milk, juice, tea, on-demand Starbucks, espresso Tea, powdered cider, drip coffee Leaning Microsoft
Time off 3 weeks vacation, 10 paid holidays, 2 personal days 2 weeks vacation (3wks after 1st year), 6 paid holidays, 6 personal days Microsoft
Location Redmond Seattle Strongly Amazon
Tools/Platforms Closed source Microsoft stack, proprietary. Many legacy desktop platforms, lots of new services Open source Linux stack. Almost entirely services-based, many legacy concerns. Best-in-class deployment tools. Strongly Amazon
On-call Expected of most engineers (unless product has no services component, increasingly unlikely) Expected of most engineers Leaning Microsoft
Median Age 33 32
Median Tenure 4.0 years 1.0 years

Career Trajectory

The great thing about Microsoft is that there’s always a career path for people who want to become valued individual contributors. However, you should be aware that the difficulty level ramps up pretty quickly. Generally, most ICs are unlikely to earn the title of Senior SDE in less than 4-5 years, and Microsoft will rarely consider someone for a lead engineer (the first rung in the management ladder[2]) who has fewer than 6-7 years under his belt. However, the promotions don’t stop just because you don’t want to be a manager – excellent ICs can earn titles like Principal Engineer, Distinguished Engineer, and Technical Fellow who are respected and valued as much as Corporate Vice Presidents.

At Amazon, expect a lot of responsibility to ramp up fairly quickly, along with somewhat higher chances for advancement — both because Amazon is growing faster, and because it has higher rates of attrition (I suspect attrition is higher at the bottom than the top; but I have no evidence for this). Three years out of college is not atypical for being offered SDM I (first rung on management track). This is partly because of the horrible retention; by the time you hit 3 years, you’re more tenured than about 80% of the company. Anecdotally, I have heard talented Microsoft ICs on the management track note to me that specific Amazon counterparts are progressing faster (to development manager) than themselves. So if management track progression is your goal — pick Amazon, not Microsoft.

Health Benefits

Microsoft in 2014

  • 100% preventative care covered, always.
  • HSP ($1000-$2500 annual employer HSA contribution, $1500-$3750 deductible; $1000-$2500 coinsurance) or HMO (no deductible/limited coinsurance, copays of $20-$100 for outpatient service)
  • full or partial dental coverage + payroll credit
  • vision: free annual eye exam and up to $225 of vision hardware per year; lasik benefit
  • free gym membership OR up to $800 in cash reimbursement for fitness purchases OR $200 cash
  • free life insurance – 2x annual base pay
  • long term disability insurance – 60% of monthly income up to $15,000
  • optional accidental death & dismemberment

Amazon in 2014
Documents I got my hands on weren’t heavy on details. I’m just going to go out on a limb here and say Microsoft’s health benefits are better. Here is a copypasta from their careers page that tells you approximately nothing about how they compare to Microsoft:

  • A choice of four medical plans, including prescription drug coverage, designed to meet your individual needs, with domestic partner coverage
  • Dental plan
  • Vision plan
  • Company-paid basic life and accident coverage as well as optional coverage at a low cost
  • Company-paid short- and long-term disability plan
  • Employee assistance program including dependent-care referral services and financial/legal services
  • Health-care and dependent-care flexible spending accounts

Fringe Benefits

Allow me a moment to blow you away with the absurd benefits Microsoft offers. Prime Card gives you random discounts on everything from Apple products to local restaurants. It also gets you discounted admission (I think $5?) to IMAX movies. Microsoft hosts free onsite health screenings for general health, flu shots, glucose/cholesterol testing, etc — and even gives away gift cards for attending. They have a charity matching program – they’ll match dollar for dollar every contribution you give to registered charities, and also pay $18/hr to any charity you volunteer at to increase your impact. There’s a discounted group legal plan that costs, I think, something like 30-40 dollars a month for routine legal work. There’s tuition reimbursement. There’s generous paid maternity (AND paternity) leave.

Amazon discounts 10% (up to $100 off) of annual Amazon.com purchases, which is cool too, I guess.

Commuting, Culture & Tools

Microsoft runs free shuttles to most major residential areas nearby – the largest private bus system in the world, in fact. On top of that, they provide a free ORCA card for unlimited free travel on the local bus system.

Amazon also has a free ORCA card on offer, but only a limited private shuttle system between campuses.

Amazon is in Seattle proper, where anything vaguely resembling nightlife happens; Microsoft is on the so-called Eastside across a narrow bridge where basically nothing does. This is not an insignificant issue for many people who work at Microsoft but want to live in Seattle – this is likely to extend your commute by at least 30 minutes each way.

As far as tools go, both companies have first-rate toolchains. Amazon probably leads here, as they have a very impressive toolset, dependency management system, and deployment process. On the other hand, Microsoft’s approach to the software engineering process is both much more disciplined, and less flexible. They produce some of the finest program managers. And almost all their tools are closed-source, so you’re unlikely to be using, say, git, unless you work at Amazon. The downside of Amazon’s agility is a sometimes chaotic software development process; getting stuck on a team with a mandate to improve a service while simultaneously fixing bad architecture/rush job warts are not uncommon, and unrewarding.

Work-life balance is manageable at both companies. I’ve had a number of 60 hour weeks, maybe even a few 70-hour weeks near shipping time. They were out of the norm. I’m inclined to say Microsoft requires fewer hours on average than Amazon, where people might see 45-50 / week as closer to normal. Everyone will tell you “how much work you get done” matters more than “how many hours you put in.” This is a half-truth – you need to put in the right amount of face time, don’t be on either side of the bell curve.

If you want to work in a fast-paced environment leading the way in services, cross your fingers every time you deploy, and don’t mind getting paged in the middle of the night, work for Amazon. If you want to make slightly more money shipping desktop software (or deploy services like you would ship desktop software), and pretend with your 100,000 coworkers that the company is becoming “agile,” work for Microsoft.

Thanks to all the fine folks who answered my questions and reviewed early drafts of this.


[1] This is no longer true on many teams at Microsoft. For the most part I hear its not as bad as at Amazon, but there can be what I charitably call “rough patches” when a team implements on call rotations for the first time, and invariably screw things up until the alert frequency can be tuned correctly (ASK ME HOW I KNOW).
[2] Microsoft recently moved away from formal lead positions as of 4Q 2014, bringing it into alignment with most other companies like Google and Amazon. Basically all ICs report to a dev manager now, and a “lead” engineer has no direct reports any more, but has de facto authority over a project or team. This doesn’t change the fact that progressing from an IC to a manager at Microsoft is both very hard and takes a long time.

The State of Securing HTTP in 2014

In a post-Snowden world, it’s not unreasonable to ask that every site consider deploying “all HTTPS, all the time,” even if just to troll the guys who really want to track what videos I’m watching on Youtube. Here’s how the process breaks down, based on my research and experience. This is not a guide, but a general overview of the state of the art.

Performance was a major reservation I had going into this. So I’m happy to say: SSL/TLS is not only inexpensive, it’s ridiculously cheap. In my case, CPU load increase was on the order of 1%, which is a rounding error. Memory usage might have gone up by a couple KB per connection, and there was no noticeable increase in network overhead. In an end-to-end test, a browser with a cold cache actually loaded and rendered secure pages in (statistically speaking) the same amount of time. On the server side: visually it’s impossible to tell from perf graphs when SSL was enabled. Serving > 6 million hits a month.

There are a couple of things needed to achieve this transparent performance level while still achieving high security, and it is almost entirely configuration-dependent with very little having to do with the application itself.

Session resumption

SSL and TLS support an abbreviated handshake in lieu of a full one when the client has previous session information cached. The full handshake takes two roundtrips (plus one from the TCP handshake), but session resumption can save the server from doing an RSA operation for the client key exchange, as well as a roundtrip. But the big win is the reduction of a roundtrip.

Here is where things start to get weird. There’s at least two forms of session resumption. The session identifier method is baked into SSL and is therefore supported by default – enable this on the server and it should Just Work. The one issue with this is that the server must maintain a session cache. Worse, if you have multiple nodes in the backend you need to either move this session cache into a shared pool or implement some kind of session affinity (urgh). Or perhaps you could do something even stupider and have SSL terminate at your load balancer, defeating the purpose of having a loadbalancer.

Luckily, TLS has an optional extension (described in RFC 5077) that makes the client store the session resumption data, including the master secret that was negotiated and the cipher. This session ticket is further encrypted and readable only by the server to prevent tampering. In short, implementing TLS session tickets as a resumption protocol gives you the best of both worlds – a faster, abbreviated handshake without requiring the server to cache anything. Unfortunately this is an optional extension to TLS, so it is not supported by all clients and servers, and a pool of servers needs to be configured properly so they can share tickets by using a common encryption key (unless you have session affinity, in which case it doesn’t matter).

Certificate Chains

Most certificates require an intermediate certificate to be presented along with them. There could even by several intermediates. Ensure the following:

1) Send all required certificates to validate the chain. If you don’t, things will probably still work because browsers will tolerate anything short of genocide, but they’ll probably do a DNS/TCP/HTTP dance in the middle of your TLS handshake to some other server to grab the certificate, which is obviously no good.

2) Don’t send any extraneous certificates. Besides the obvious slowdown from sending unnecessary data, you *might* in some cases cause an extra roundtrip if you overflow your TCP window and end up having to wait for an ACK before sending more data. So yeah, fewer packets matter here.

OCSP Stapling

OCSP is the Online Certificate Status Protocol. When a client receives the server’s certificate, it normally connects to the certificate authority to ask if the certificate has been revoked. You can save the client a roundtrip to the CA server if you enable the OCSP stapling extension, in which the server periodically connects to the CA to perform the OCSP check itself, then staples this response to the client during the handshake.

This works because the CA’s response is both time-stamped and signed. Clients can be assured that the OCSP response has therefore not been tampered with, nor can it be used in a primitive replay attack since it has an expiration tied to its validity.

There are some aggravating issues here. One is that OCSP stapling only allows one response to be stapled at a time, which is problematic for certificate chains and will probably end with the client making its own OCSP calls anyway. Another is that OCSP responses can be relatively heavy (like 1KB-ish), which combined with the certificates themselves can overflow the TCP window and cause a roundtrip for ACKs, as mentioned above.

Cipher Suites

This part can get filled with conjecture and hypotheticals pretty quickly, but basically you have two choices to make:

1) Cipher: A stream cipher like RC4 is fast and doesn’t require padding; this saves bytes on the wire. A block cipher like AES will need padding bytes, but may be more secure. AES-256 is overkill for a 1024-bit public key, but that’s not really a concern since most keys are 2048-bit now.

2) Key exchange: RSA or DHE+RSA? With ephemeral Diffie-Hellman support you can enable Perfect Forward Secrecy, which prevents recorded traffic in the past from being decrypted even if the private key is compromised. This is really powerful and really secure, but it also breaks debugging tools like Wireshark since having the private key doesn’t help you decrypt the traffic. You’ll also handshake at about half the speed of pure RSA.

The reality is that it’s more likely you’ll get pwned by some buffer overflow or Heartbleed-type bug than have someone factor your key, so keep this in mind when selecting your key exchange algorithm and cipher.

Other Considerations

You may need HSTS (HTTP Strict Transport Security) if you have concerns about SSLStrip vulnerabilities. SSLStrip is a man-in-the-middle attack in which the MITM silently redirects requests to non-HTTP pages and then simply copies the transmitted plaintext data. You can implement “HSTS” at the application level selectively by detecting and forwarding to the HTTPS version of your page. But if you are doing this nonselectively, why write more code to do something slower higher up in the stack? A lack of protection at both the application and protocol level will result in fun things like this though: https://codebutler.com/firesheep/

Application level changes may be required. Some code is more prone to breaking or causing issues when served over HTTPS. Poorly written Javascript for example can break if it isn’t expecting HTTPS, and external services that don’t support HTTPS can be problematic. Specifically, all images/resources need protocol-agnostic links AND need to be served securely as well, otherwise you’ll get an annoying “insecure page” warning in the browser since not every resource was sent securely.

Older browsers like IE 6/7/8 screw everything up and I don’t think they even support TLS; you can either support IE 6 / Windows XP or your connection can be secure.

There’s some moderate, but not insurmountable, IT overhead to enabling SSL/TLS. Apache and Nginx support SSL/TLS either out of the box or via easily enabled modules, so it isn’t hard to set them up and configure them to use OpenSSL. But configuring the options as explained above is paramount. So is keeping OpenSSL and other components up to date.

You also need to remember to renew and swap out expiring certs, as well as securely store and back up private keys, cert signing requests, passwords and certificate authority information as well. This is because in (one|three|five) years, it’ll be time to remove the expired certificate and deploy a new one, so you better A) remember where you got your certificate; B) remember how to log in and buy, request, and obtain a new one; and C) test, deploy, and restart your services using the new certificate.


You should enable SSL. It’s not that hard, performance is a non-issue, and you can buy a cert for about $15 a year. Put a date on your calendar to renew it.

Raymond Chen's lessons

A random collection of wisdom from Raymond Chen and The Old New Thing. I plan to keep this updated as I discover/remember more of them.

Windows doesn’t have an expert mode because you are not an expert.
This is just the Dunning-Kruger effect in play: people who are not experts pretty much by definition lack the ability to judge whether they are experts or not. “Expert users” using the advanced features of Windows invariably make feature requests that are equivalent to the beginner feature that already exists.

The hatchway is still secure, even if you opened it with the key.
It’s not a security bug if the user has to first give permission to elevate. Bogus security reports of this nature generally go like this:

  1. Do something that requires elevation, such as replacing an application’s DLL with a malicious copy.
  2. Run the application.

Except, it’s not a security bug because step 1 required elevation, and therefore an administrator’s consent.

Eventually, nothing is special any more.
If you create special functions or flags in your API to give them extra functionality, they will in practice become the defaults over time, as programmers cargo-cult their way through programming. Eventually people find that the regular function “doesn’t work” (for various definitions of “work”), and that the special function does.

Providing compatibility overrides is basically the same is not deprecating a behavior.
“If you provide an administrative override to restore earlier behavior, then you never really removed the earlier behavior. Since installers run with administrator privileges, they can go ahead and flip the setting that is intended to be set only by system administrators.”

Appearing to succeed is a valid form of undefined behavior.
Undefined means anything can happen, including: returning success, nothing, formatting your system drive, playing music, etc. So it is futile to ask “if the documentation says doing x results in undefined behavior, why does it appear to work?” Also, one cannot rely on a specific form of undefined behavior; relying on it implies the behavior is defined and contractual.

The registry is superior to config files.
Config and .ini files are deprecated in favor of the registry because:

  1. ini files do not support unicode.
  2. Security is not granular (how do you restrict a group from editing a certain part of the file?)
  3. Atomicity issues with multiple threads/processes can lead to data loss on the flat file (the registry is a database).
  4. Denial of service issues – someone could just take an exclusive lock on your config to screw with you.
  5. ini can store strings only, so if you need to store binary you’d have to encode it as a string.
  6. Parsing files is slower, and writing settings would require loading and reparsing the whole file.
  7. Central administration via group policy would be exceedingly difficult compared to a registry.

Computer science: do not confuse the means with the ends.
It is often said that the purpose of garbage collection is to reclaim unused memory, but this is incorrect. The purpose of garbage collection is to simulate infinite memory. Reclamation is just the process by which this is achieved. For example, a null garbage collector is provably correct if you have more physical memory than your program needs. Similarly, allocating a value type on the stack is an implementation detail. It’s not a requirement that it is on the stack, only that it is always passed by value.

Open source isn’t a compatibility panacea.
You don’t get rid of compatibility problems by publishing source code; in fact that makes it easier to introduce compatibility issues because it exposes all the internal undocumented behaviors that aren’t contractual.

You can’t satisfy everyone about where to put advanced settings.
This is a specific case of not being able to delight all the people all the time when the audience is measured in billions. Most people prefer advanced settings in one of five categories (quoting Raymond):

  1. It’s okay if the setting is hidden behind a registry key. I know how to set it myself.
  2. I don’t want to mess with the registry. Put the setting in a configuration file that I pass to the installer.
  3. I don’t want to write a configuration file. The program should have an Advanced button that calls up a dialog which lets the user change the advanced setting.
  4. Every setting must be exposed in the user interface.
  5. Every setting must be exposed in the user interface by default. Don’t make me call up the extended context menu.
  6. The first time the user does X, show users a dialog asking if they want to change the advanced setting.

Each item is approximately an order of magnitude harder than the last, and the final one is objectively user-hostile. Whatever you decide to implement, the other five groups will call you an idiot.

Cleanup must never fail.
Low level cleanup functions don’t have very many options for recovering from failure, so they must always succeed (they may succeed with errors, but that is not the same as failing).

Don’t use a global solution to a local problem.
Since an operating system is a shared playground, you can’t just run around changing global settings because that’s how you like it. If two applications with opposing preferences tried this, one or both of them would break; the correct approach is to change the setting in a local scope to avoid breaking other applications.

A platform must support broken apps; otherwise you’re just punishing the user.
Compatibility with apps, including incorrectly written apps, is crucial for platforms because users expect programs to work between versions of Windows. It is tempting to be a purist and declare that the apps should break, which will force the developers to fix them. In practice, the developers either don’t care, no longer exist, or don’t have the source code any more. Users will instead blame the platform and/or not upgrade.

Users hate it when they can’t cancel.
If you have a long running operation or some multi-step wizard, the user should be able to cancel. It should be clear what will and will not be saved or committed when they cancel.

Geopolitics is serious business.
It can be illegal to have a map with incorrect labels or borders (the correctness of which depends on who is looking), or to call disputed territories (such as Taiwan) countries in some places.

The USB stack is dumb because it’s dealing with dumb manufacturers.
Some USB devices have identical serial numbers which can cause non-deterministic behavior and arbitrary settings assignment, so Windows has no choice but to pretend every device is unique. This is why if you unplug/re-plug a device into a different USB port, Windows treats it like a new device and forgets all your settings. More generally, Windows could be smarter, but then things would break.

Avoid Polling
Polling prevents the hot code and all code leading up to it from being paged out, prevents the CPU from halting to a lower power state, and wastes CPU.

The Case Against Exceptions

Goto statements went out of style in the 60s, relegated today to be the prototypical example of bad coding. Yet hardly anyone seems to be bothered by exceptions, which do basically the same thing. Used improperly, exceptions behave like goto statements and can be just as bad.

Exceptions essentially allow you to move error handling code out to a dedicated location. They have the added benefit that they can propagate up the stack so you can consolidate error handling into sensible modules. This allows certain subsystems to not care about exceptions because they can be handled by a caller. Gotos on the other hand generally would require every function have its own error handling section, either because the language doesn’t support gotos to a different scope, or the fact that it would be a terrible idea even if it was supported.

As a corollary, without exceptions you end up having to check every method call to ensure it succeeded, whereas exceptions optimize for the common case and make for much cleaner looking code.

Unfortunately, these benefits are mostly illusory, if you aren’t careful. You end up needing to write much more exception handling code to handle the different cases, and it’s incredibly easy to introduce subtle and hard-to-spot bugs. For example, suppose you have the following (loosely based on a snippet by Raymond Chen):

try {
} catch (Exception e) {
    // error handling


class Package {
    void Install() {

Notice the subtle bug here: if CreateDatabase throws, then the catch statement needs to know that CopyFiles() and UpdatePermissions() have already been run, but we’ve already lost that important context because we returned from those functions already. To be correct, the catch clause needs to know exactly how the install method works, what it can throw, and in what order it performs its operations (in this case, a cleanup method needs to know that permissions should be reverted and copied files removed. And depending on how far CreateDatabase got, it may have to clean up database files as well). This introduces tight coupling that isn’t immediately obvious because in the common-case scenario, nothing goes wrong and the bug is not exposed. However, important information about where the exception originated is lost unless additional work is done to preserve this state.

More generally, exceptions can decrease code visibility, because it’s extremely hard to tell if code is correct by looking at it. Did you forget to catch possible exceptions here, or is it handled further up the stack? Does any given method document all the exceptions it could throw? How do you know without reading the declaration?

These problems are not academic and invariably many larger projects become basically unmanageable because each subsystem introduces another layer of exceptions that must be handled. This leads to what I call whack-a-mole debugging: just run the code in production, and every time an uncaught exception causes a crash/bug, you go in and find where you let the exception leak and plug the hole, then repeat this forever.

But wait! What about finally statements? Finally can help improve the atomicity of methods by cleaning up, freeing resources, and generally making state consistent again. But they are once again tightly coupled to exactly what the throwing method was doing: what needs to be cleaned up? What is the order of operations of the function that threw? More importantly, will this break if the function changes later down the line? The catch / finally blocks might not even be in the same file or class, which means the code locality is now far enough away that there is a brittle sort-of-contract here that’s probably fuzzily documented, if at all.

But wait! What about checked exceptions like in Java? Doesn’t that solve a problem by explicitly declaring what the caller should expect and handle?

Even assuming they are implemented and used correctly, checked exceptions suffer from a versioning problem. Changing what exceptions can be thrown in a method can cause calling code to break or stop compiling, so checked exceptions are actually part of a method’s signature. Want to add a new throws declaration in a library? You can’t — you have to make a wrapper method to ensure backwards compatibility, assuming other people use your code. So unless you are prepared to do this (or don’t care that other people depend on your code) you must never change the throws declaration of a method.

The alternative without checked exceptions is just as bad: a call to any random library could crash you at any moment because it threw an unchecked exception you weren’t expecting. All told, it makes it a total pain to reuse code because literally any function call is a hidden minefield of invisible gotos: can you guarantee the call won’t throw, and that everything it transitively depends on won’t throw either? No? Then you better wrap it in a massive try statement. Exceptions mean any method anywhere can return at any moment, creating exponentially more return paths for every line of code.

Another subtle problem exceptions can cause is it can force you to architect your code differently, due to disagreements on the meaning of an exception: you might throw exceptions rarely, but a library might be more liberal with them, which can force you to change how your code is structured to be more correct when using this library. Because partially-written states are a real threat when programming with exceptions, you may end up having to structure a lot of code into a “commit” phase where exceptions won’t cause problems.

Some Suggestions
1. Set and enforce rules on how and where exceptions can be thrown and where they will be expected and handled. Make a guarantee about the side effects of a function when it throws, and what the catch block will do in terms of cleanup.
1a. Avoid operating under uncertainty: don’t do anything fancy in a global catch-all block. Catch the most precise type of exception and do the least amount of work possible.

2. Create and enforce boundaries to encapsulate subsystems; do not allow exceptions to cross subsystem boundaries. This is the loose coupling pattern applied to exception handling. It will help prevent return paths from exploding exponentially.
2a. You can use a stricter version of this rule by requiring your own code to never throw. This way, you only have to worry about external libraries that might use exceptions, but you can abstract this away from other modules. Google’s own C++ style guide forbids throwing exceptions.

Why are Microsoft Products so Large?

A few months ago I anonymously answered a question on Quora, and it turned out to be my most popular answer ever, by several orders of magnitude. I’ve reposted it here, in order to expand on it a little bit.

Question (paraphrased): Why is Office more than 800MB in size, when LibreOffice can come preloaded with all of Ubuntu on a 750MB CD?

This question seems a little loaded, and is looking for an excuse to accuse Office of bloat. However, it’s important to keep in mind how difficult it is to make a software suite: you have an extremely broad user base, comprising the proverbial grandma who fires up Word to type up an email, to the banker who uses the most advanced pivot-table-sparkline-sprinkled features of Excel. Here’s a couple major reasons I can think of, in no particular order:

  • Office ships with a huge and growing number of templates, graphics, macros, default add-ons, help documents, etc. This is a major driver of bloat — it has nothing to do with lines of code, and everything to do with a vibrant, comprehensive, and growing ecosystem.
  • Office is decades old. Think about this for a moment. I’ve debugged code that was written in the early 90s. Since Office is pretty well designed and written (contrary to public perception), we almost never throw away old code. So the cumulative effects of years of new features tends to only grow the codebase. Properly leveraged, this is a major competitive advantage.
  • The sheer number of features in Office is mind-boggling. For most releases, Office closes more bugs (not sure if I’m allowed to disclose numbers) than most products have lines of code. Failure to understand how many features Office has is the #1 cause of death to direct competitors.
  • Office installs all code that it needs out of the box, with no external dependencies aside from the Windows API. This might seem counter-intuitive, but it actually makes the suite much larger. This is because we don’t rely on any third party library or framework. This can obscure the real size of installations such as LibreOffice, because it requires Java (and its default library), but nobody counts that against the size of LibreOffice’s installation. The same can be said of .NET applications – .NET itself is a massive codebase.
  • Licensing and code obfuscation plays a small factor. In addition to having to write licensing and antipiracy code that LibreOffice doesn’t need to implement, this must be obfuscated and protected against attacks. No easy feat, considering the attacker has local administrator rights. Also, Office is designed to be resistant to failure, and there is significant updating and security support built into the platform. This all adds weight.
  • Running in native code also means there are fewer abstractions; Office has code to deal with weird hardware and software configurations. It accounts for settings that stupid “registry cleaners” tweaked that would otherwise break it. It ships in 40+ languages. It knows how to deal with paths that exceed 255, or contain unicode characters. It contains security checks to defend against users opening malicious excel documents. There are hundreds more examples of things Office does that nobody realizes, but which would be sorely missed if they disappeared. This all takes code.

In general terms, Microsoft products optimize for the long tail of use cases. This means it has lots and lots of features that are seldom used. The 80/20 rule applies here: 80% of the users use 20% only of the features. There is a nuance to this rule though: every user uses a different 20% of the product. This means a software suite needs exponentially more features to capture a larger and larger share of the market; Office owns the market.

The reasons for Office’s dominance is poorly understood, and often attributed to format lock-in, or being the existing standard. But Office really wins by fully exploiting its economy-of-scale, and size is a side-effect of this. The massive user base allows Microsoft to invest in features that are relevant to only a small segment of users and still turn a positive ROI. But Microsoft often invests in features even when ROI is negative. Subsidizing unprofitable features means Microsoft can do lots and lots of things that competing software won’t do. This means competitors will have to burn money to catch up to Office, which they won’t have because they don’t have as large a customer base[1]. This essentially guarantees Office’s dominance.

It also explains why Office takes up so much space. It’s not a bug; it’s a feature.

[1] Except Google, which is apparently happy to subsidize from search.

Why do new computers have so much crapware?

tl;dr: because that’s what the market wants.

The commodity PC business is very competitive. The margin on a typical consumer desktop or laptop computer is at break-even or less. Mostly, profits come from selling support or extended warranties and the like.

Imagine you are a PC manufacturer. Due to intense pricing pressure, you are basically losing a couple dollars on every sale. Now imagine some software vendor comes along and offers you $10 per unit to preload Office, or some antivirus, or whatever. Would you rather:

  1. Say no, and continue to lose money on every sale.
  2. Say no, and raise prices to cover costs, but lose all sales to your competitor.
  3. Say yes, and preload the crapware to lower the cost-per-unit?

Hint: only one of these results in you staying in business.

How is any of this the fault of consumers? Because consumers will search online for five hours to save 7 bucks on an 600 dollar laptop. In this kind of environment, you either preload trialware, or your prices aren’t low enough to move any units.

The same kind of nonsense is also happening to the airline industry[1], which is why they are always inventing baggage fees and dreaming up ways to charge customers for using the restroom.


[1] To be fair, there are other causes, one of them being an entire industry designed to work when oil is < 80 dollars/barrel and go bankrupt when it isn’t.

People prefer Bing over Google when the labels are swapped

SurveyMonkey last month released some surprising insight from a study they recently did comparing users’ search preferences.

The result? It turns out people prefer Bing over Google, but only if you label them Google results. Actually, if you correct for the Google brand, people outright prefer Bing.


Why does this matter? Because it means Google’s search quality is actually inferior to Bing’s. If you look at the preference graphs, this is obvious because Google slightly edges out Bing when the labels are correct, but when the labels are swapped, Bing’s results shoot WAY ahead. However — there are a couple nuances we can’t get into here. For example, it’s not clear if Bing is universally better over the set of all queries, or if they managed the trivial task of optimizing the most common ones. Anecdotally, my experience is that Bing is fairly good at long tail queries, although I have sometimes had to switch to Google for very specific and obscure searches about narrow subjects. Unfortunately, nothing in SurveyMonkey’s blog post gives us any further clues on this.

This should be a major coup for Bing, but it’s not clear what they do with this information: after all, they’re still Bing. Basically, it’s not clear whether the problem is that they’re not Google, or that they’re Microsoft. I suspect it’s probably a mix of both: old habits die hard, and Google is good enough for most people. You’re not going to get an order of magnitude improvement in relevance like Google was over the old search engines. And even though search theoretically has very low lock-in, the incentives to switch are actually fairly low: marginally better results in exchange for changing a well-practiced workflow, and admitting that an iconic search engine is no longer the best. Never underestimate the human affection with rationalization.

In addition to many people instinctively having a grudge against Microsoft, they also have a fairly terrible marketing department. Remember, these are the guys who came up with the name “Windows Phone 7 Series” and insisted that was the official name. Also recall that Bing in some Chinese dialects sounds a bit like “disease” or “sickness.” That’s not exactly the kind of connotation you want with the world’s fastest growing internet population.

It’s interesting to note that SurveyMonkey was not commissioned by Microsoft to do this study; in fact, Google is an investor in SurveyMonkey.

Survey results here:

Hey Engineers, Let's Stop Being Assholes?

There’s this toxic idea in tech circles right now that’s starting to get really tiring. And it pains me to have to point this out because I could just blissfully go along with it, and give myself that self-congratulatory pat on the back that most of the tech world is doing on a nearly daily basis.

It’s the elitism. There’s this culture (cultivated by engineers) that worships engineers and shuns everyone else for not ‘being technical.’ This culture is backwards and counterproductive. It presupposes that engineering is the only thing that matters, and that everything else must defer to it.

There’s a reasonable origin for this line of thinking. Back in the dot com days, business majors were raising 50 million to make online wedding invitations and going public because they had a homepage. MBAs were looking for some engineers to “code up this idea quick,” as if the tech part of a tech company was just this checkbox that needed filling. As if engineers were these interchangeable cogs in the machine of a startup. Of course, those tech companies imploded and for most of them, technology wasn’t the primary cause. But even really great business ideas often failed for lack of technical expertise. It turns out people who don’t know how to create software are also terrible at recognizing how important (and hard) it is.

But the reverse is also true. And that’s how far the pendulum has swung in the other direction. There are lots of engineers proudly proclaiming that everything that isn’t engineering is just some checkbox department filled with warm bodies who weren’t good enough to be programmers. As those in the know have known for a long time, it turns out things like business development and “having customers” is pretty important too. It turns out people who don’t specialize in a non-technical role are also terrible at realizing how hard and important it is.

Which brings me to the general observation that everyone thinks their job is obviously the most important and indispensable. Not surprisingly, everyone is wrong, but engineers have convinced themselves that because the MBAs were demonstrably wrong about engineering, engineering must be right. Which is just a logical fallacy wrapped in wishful thinking sprinkled with the chocolate covered bacon bits of all your friends who also happen to be engineers agreeing.

This false premise (engineering is everything) leads to all sorts of crazy conclusions. One of them is that everyone should learn to code. This is stupid, and a waste of time. There’s no substitute for computer literacy, but saying everyone should learn to code is like saying everyone should learn to drive manual transmission and change their own oil: cars are everywhere! Cars are the future! If you don’t drive, you will not be in control of where you are driven! This kind of alarmist propaganda is nonsensical and should be laughed out of the room. The whole point of software engineers is so other people don’t have to code!

This phenomenon coincides with a related one that also annoys me: the insinuation that being an engineer automatically demonstrates your superior intellect. The recent shortage of engineering talent in the US exacerbates this feeling, because it’s easy to conclude that the problem is because people aren’t smart enough to become software engineers. Actually, it’s mostly[1] because 1) most people think programming is about as sexy as mopping floors, and 2) for the past decade, smart people who just wanted to make money could make more for the same hours and less risk — in finance.

Software engineering is actually not that hard. There, I said it. Basic computer science type education and work is not much harder, conceptually, than intermediate calculus. The majority of the population is capable of being taught, and understanding, intermediate calculus. We know this because lots of countries teach both in middle school. QED.

This means that being a software engineer is not beyond the intellectual capacity of the average joe[2]. It also means engineers need to stop waving their diplomas around like they’re computer astronauts[3]. It makes us all look like elitist assholes, and it’s holding back our profession.


[1] Ignores our immigration problem, since it’s better if this isn’t about politics.
[2] Speaking strictly about proficiency, of course. We all know there’s a very high skill ceiling, and being a “great engineer” is a whole other ballgame. But this too has lots of external factors not related to innate skill.
[3] But don’t take this to mean we shouldn’t be proud of what we do.

Understanding Google's Bug Bounty Program

Some people have taken Google’s idea of offering security bug bounties, and taken them to their logical conclusion: why stop at security bugs? Why not incentivize reporting of ALL software bugs with bounties? Aren’t other companies cheap for not offering bug bounties?

Questions along these lines misunderstand how software development works. Engineers don’t sit on their hands and surf Reddit after shipping a product. They’re already working on bugs. All sufficiently complex software ships with known bugs; more reporting isn’t likely to change whether they get fixed or not.

So the premise that “reporting more bugs will improve software quality” is speculative, at best. Software quality is determined by what the market will bear. The market usually rewards buggy-but-good-enough software that solves a problem now, rather than perfect solutions that are late to the party. This is partly because of the time value of software, and partly because chasing defects offers diminishing returns.

But more to the point: everyday users are not equipped to report bugs. They don’t have the training, tools, or motivation to do it properly.

At Microsoft (and other software companies), crash dumps already include as much information as can be legally collected, based on the user’s consent. So bug information for crashes and many other issues (uncaught exceptions, for example) are already being collected in an automated, accurate way. So really, any well-supported software already has built-in reporting for most high-impact bugs.

In addition, you run into other problems with bug bounties:

  • Common bugs will be over-reported, wasting everyone’s time. Customers are already motivated to report these since they want them fixed.
  • Unexpected, but by-design behavior will be incorrectly reported as bugs.
  • Problems will be exacerbated by the bounties, increasing volumes and decreasing quality. This actually creates noise around what is really important to fix (e.g., the defects users would report even if there was no bounty).
  • A bug bounty program isn’t free. Someone has to triage the input and it’s a zero sum game: either a developer does less productive work sorting through bounty submissions or headcount grows. It’s not like you can fire the testing team.

The only exception to this rule is security bugs, which operate a little bit differently than run-of-the-mill defects:

  • There is already a market for security bugs, which can be sold to hackers. The developer is simply trying to outbid them to keep the product secure.
  • This means there’s already a set of professionals who are hunting for such bugs; professionals are much more likely to find bugs on account of understanding how software is designed and implemented.
  • Users are unlikely to notice or report security bugs since they generally don’t obstruct functionality, meaning there are fewer dupes to wade through, and bug reports will be of higher quality on average.

So in my opinion, paying bounties for security bugs can be effective, but its not likely bounties for functionality bugs in the general case would be particularly productive.

OAuth Hell

It’s a pretty sad fact that OAuth has come to be a de-facto industry standard for API authentication, because OAuth is so broken.

Before OAuth, creating and consuming APIs across services was hell. We mostly just did stupid stuff like asked users for their passwords, so we could log in on their behalf and maybe do some page scraping stuff. If a proper API actually existed, they probably implemented a custom authentication protocol that required you to read their exact implementation, permissions, and handshake procedure, making the interface un-reuseable.

Ideally, OAuth would come along and solve these problems for us by:

  • Allowing untrusted applications to perform actions on behalf of a user at the API provider.
  • Authenticating the user’s permission to perform said actions, without divulging the user’s password.
  • Selectively granting permissions to an untrusted client, to prevent hijacking of account & login details.
  • Revoking the client application’s privileges at command of the user, without requiring a password change.
  • Promoting code reuse through a standard protocol for negotiating access to an API provider.

OAuth takes every single one of these requirements…and partially solves all of them.

While OAuth is conceptually great, and is much clearer in the 2.0 spec, it still contains a number of warts that make it a complete pain to integrate. Consider this:

  • There’s basically no standard for how to implement an OAuth provider. Try pointing an OAuth client at a different provider and try to count the number of changes you have to make to get it working. It’s mind-boggling the number of unique tweaks and quirks that providers come up with. The whole concept of interoperability is thrown out the window, and you have to go back to “well, it generally works this way, but you have to read all their developer docs and spend an afternoon conforming to the custom design.” And this is all by design! Read it straight from the spec:

    …this specification is likely to produce a wide range of non-interoperable implementations…
    …with the clear expectation that future work will define prescriptive profiles and extensions necessary to achieve full web-scale interoperability.

  • Scopes are pretty much a crapshoot. Take a look at this passage from the spec:

    The value of the scope parameter is expressed as a list of space-delimited, case-sensitive strings. The strings are defined by the authorization server.

    So…how do you find out what scopes are supported, allowed, or required? Surprise! You don’t. You have to read the developer docs. Assuming they are posted. More generally, it’s impossible to programmatically register a client, learn about server capabilities, discover endpoints, and most other things. Which means hours of slogging through documentation to manually code these into the OAuth client, all so you can do it again for the next provider. Did Amazon’s success in services teach us nothing about the value of being able to programmatically discover, query, register with, and use a service?

  • OAuth web flow requires you to visit the provider website through your browser. This makes sense, of course, since you need to authenticate with them before you can authorize the client app. But this flow doesn’t work on mobile, in which case you need to use a different flow, one that requires you to enter your password into the untrusted app. Ugh. To date, there hasn’t been a really good mobile story here from anyone, and we’re still in the dark ages as far as mobile apps are concerned. Which is a shame, because back in my day, we used these things called web browsers.
  • There are security issues you can drive a truck through. Consider this cogent explanation by John Bradley of how granting access to an OAuth client application also gives it the ability to impersonate you at *any other* OAuth client for that provider:

    The problem is that in the authentication case, websites do have a motivation to inappropriately reuse the access token.  The token is no longer just for accessing the protected resource, it now carries with it the implicit notion that the possessor is the resource owner.

    So we wind up in the situation where any site the user logs into with their Facebook account can impersonate that user at any other site that accepts Facebook logins through the Client-side Flow (the default for Facebook). The less common Server-side Flow is more secure, but more complicated.

    The attack really is quite trivial once you have a access token for the given user, you can cut and paste it into a authorization response in the browser.

OAuth is an ambitious project that has given us a glimpse at how awesome an interactive web can be. It’s just a shame that this is what we’ll have to settle for given the slow pace of improvement in such a widely used authorization framework.