In nuclear weapons design, there is a two-man rule that prevents any single individual from accidentally — or maliciously — launching nuclear weapons. Each step requires knowledge and consent from two individuals to proceed. Even when the President initiates a launch order, he must jointly authenticate with the Secretary of Defense (they’re given separate codes, even though the President has sole authority).
When the order reaches the launch control center, two people are required to authenticate and initiate the launch, for example by (vastly simplifying…) turning two keys simultaneously.
The benefits are at least twofold. First, it’s much harder to compromise or impersonate two people simultaneously than it is to compromise one. Second, it also provides error correction. When two people are involved in a process, it’s much more likely that if someone is about to make an oversight or error, it will be caught. This works better when the roles are asymmetric, because then they won’t both be on the same “wavelength.” Most good processes of this type seem to be asymmetric in some way.
There are many contexts where we want error correction and extra security: executing large financial transfers, preparing patients for surgery, performing space shuttle launch checks, or running nuclear reactors. It also comes up a lot in software development, which is what got me thinking about this. Let’s count the ways we implement the two man rule:
Code review: Everyone is either doing this or making bad excuses for why they shouldn’t. But it’s the clearest and most accessible example of a two-man rule in software engineering.
Spec review: An essential part of any sizable project is a review of the specification to make sure, in particular, that 1) the right thing is being built in the right way, and 2) the right people and teams are aware of any impact the work might have on them.
Continuous integration: The branch built on your machine, but does it build on another one? This turns up countless “oh right I added this config variable/package and forgot to propagate the change” incidents before they become blocking.
Pair programming: I think of this as just real-time code review. It has all the same benefits and more, with the downside that it can’t be done asynchronously.
Deployments: I wish we did this closer to 100% of the time, but it has definitely been helpful to have a second person on hand for deployments in addition to the primary engineer. This is especially critical during complex deployments that happen in phases or involve many moving parts. Ideally the role is relegated to going through the checklist one last time (“says there are database migrations, are we expecting downtime or can we keep pre-boot on, and if so is the config correct?”), and in the event of an issue, helping to investigate or doing the checklist in reverse to roll back.
Mind the Gap
As we continue to grow, there are a few areas where I think a more consistent two-man rule will lead to high return on effort in the future:
- manually rebooting servers, changing server counts or container types
- adding/scaling services
- running one-off commands against the production database
And yes, every once in a blue moon we deploy tiny changes to production without full code review, or force a failing build onto staging — something that is intentionally difficult and unwieldy to do. This has gone from rare to extremely rare, and I expect this trend to continue. But I like processes to be developed and enforced bottom-up if possible, and prefer values over inflexible rules. So far this tenet hasn’t failed us, and we still trust each other with good judgment above all else.
However, as the stakes get higher every day, the cost/benefit equation will eventually tip towards a standard operating procedure that can be summarized as “trust, but verify.” If that doesn’t sound like a good proverb to live by, maybe a second opinion is in order?