Hang out around techie circles enough and you’ll eventually hear a variation of the astonished “X has how many engineers?”, where X is every notable startup and technology company ever created.
For example, here are some undated (but relatively recent) ballpark numbers for product and engineering staff divisions of randomly selected companies:
- Instagram: 1,800 engineers
- Dropbox: 600 engineers
- Twitter: 2,400 engineers
- Netflix: 2,000 engineers
It certainly doesn’t seem like it would take that many engineers to build what each business is superficially best known for. For example, new Ruby on Rails users often build a Twitter clone as something of a “hello world” app in under a week, sometimes a single day. Doesn’t that imply Instagram at 1,800 engineers or Twitter at 2,400 is overstaffed?
This “hobby perspective” line of thinking assumes that systems are linear and don’t interact in complex ways (which is basically how hobby software behaves). When working on hobby or academic projects, it’s easy to spend large chunks of the time writing “user-facing” features. And as relatively young and short-lived projects, they often don’t live long enough for technical debt to matter. Mentally extrapolating effort expended to “real world” production systems is therefore a waste of time, and all our intuitions are going to be wrong (as they usually are when the systems involved are non-linear).
On the other hand, real businesses that run production systems are like icebergs – you never see 90% of the work because a significant and nontrivial portion of the engineering effort is behind the scenes, supporting the business and the folks who run it. So what does it take to bring a hobby application into the real world?
Sure, you can click the “roll back” button on a little Heroku hobby app, but real systems have databases and side effects that often can’t be fully undone on bad deploys. Ensuring uninterrupted communication with internal and external services during deployments (either with simultaneous deploys, careful API versioning, pre-negotiated data structures) is hard and continuous work. But putting thought and engineering resources into a deployment process that is safe and optimized for the business is a major competitive advantage (this is not a controversial or particularly original opinion).
Large amounts of effort are expended building integrations and features to help support and customer service teams efficiently do things like reset passwords, merge/unmerge accounts, deal with hijacked logins, etc.
Back-office admin system
Aside from support, many functions want access to data and records stored in the system – sales, marketing, management…all for different reasons, with varying degrees of access, and with different levels of auditing required. And no modern service is an island — dozens, if not hundreds of external integrations is par for the course for production apps. Most obviously, marketing and sales need to push and pull from their respective systems, and engineering and ops will want visibility into the performance of that code, leading to more code and overhead, ad nauseum.
Redundancy and high availability
High availability is incredibly important to any production service, and sooner or later, abstractions will leak, so code and effort has to be spent to deal with failover and reduced redundancy conditions. And of course, larger teams will eventually need staff specializing in the “ops” side of engineering.
Backups and disaster recovery
Backup and restore systems need to be built and regularly tested, consuming staff headcount and resources.
Any non-trivial service receiving production traffic probably needs to deal with caching, which introduces more infrastructure, code, testing, and overhead that needs to be dealt with. Cache invalidation (in particular) requires careful detection and handling of edge cases.
Logging / telemetry / performance instrumentation
Engineers build production systems to be instrumented so performance issues and defects can be investigated and resolved. And the data needed for performance instrumentation differs from the data needed for effective debugging. This introduces yet more code, infrastructure, and overhead.
Monitoring and alerting infrastructure
Logging and performance instrumentation alone won’t alert you when your systems crash (they’ll just stop sending new data), so again we need to introduce more code, infrastructure, and overhead to build or integrate a monitoring tool and set up and maintain the alerting infrastructure, make sure that it’s properly integrated with everything, and regularly tested.
Auditing / history / access control
Production systems need access control, auditing, history logging to protect users, investigate reported bugs, detect and deal with abuse, and safeguard privacy.
Billing, discounts, refunds, dunning, invoices, prorating…
I built or was closely involved in this several times, I could go on for quite some time –billing is a large and still surprisingly difficult system to get right, despite its obvious importance in production services. In the age of Stripe and friends, there is still a huge amount of irreducible complexity when dealing with anything involving exchange of money – all kinds of code (or ugly manual hacks) emerge to deal with coupon codes, monthly/annual payments, refunds, prorating between plans, metered usage, disputes, international payments, the list goes on.
Internal and external documentation must be written and maintained for every audience that touches the system – from internal engineering and customer service to end users and external developers. This must be continuously updated, and when the updates are disruptive enough, managed through change management. Versioning becomes important so that documentation updates are synchronized with code releases, and things become look more like 3D chess if managing multiple different product lines, and don’t forget that product and marketing are running a bunch of A/B tests with 6 different UI variations at any given time. Redirects have to be written and introduced, so that the docs are internally consistent (e.g., pages don’t randomly link to 404s).
Since code originates from the product and engineering department, and code is maintained by that, good documentation originates from product and engineering.
Anti-fraud, anti-abuse, security / anomaly detection
Most production services eventually need to have some sort of answer for abusive users, pentesters, script kiddies, spammers, and trolls. Most hobby apps don’t need to deal with this at all. It is shocking how far people will go to probe for issues in a production system.
- things related to managing team overhead
- test infrastructure
In short, writing a “Twitter clone” is easy. But that’s because writing the features are the easiest part of the business.
 I once heard an elegant definition of a production service: It’s one where someone is being charged money. With very few exceptions, it seems pretty accurate.