Built to Fail: How companies like Google, IDEO, and 37signals build failure-tolerant systems for anything!
Planning for success, not failure
High achieving people who have a long history of being successful often plan accordingly – doing so, of course, means that they plan for success in whatever they do. And when you take a successful person and put them in a successful big company that’s already making money from their products, there’s even more reason to plan for high-achievement outcomes.
But let’s say that you put these successful people and put them in environments of great uncertainty, like at a Silicon Valley startup – what happens? That’s when realities collide! When you apply the big successful company playbook to startups, you can end up with monolithic planning processes, products that can’t find their markets, and lots of money being spent on launches for the wrong products. It’s not that these tactics are stupid, it’s just that they don’t work as well when you’re dealing with ill-defined customer problems with unknown solutions.
At the heart of this conversation is – what happens when you take something that’s usually assumed to be successful, and you instead say that it’s very likely to fail?
In a way, you can think of this as planning to fail, but then building the support structure around the failure in order to create a failure-tolerant system. Let’s dive into this.
Planning for failure, not success
The title of this blog refers to the fact that companies like Google, IDEO, and 37signals all have the culture of “Failure is OK” built into them.
- Google makes money by being always available, ubiquitous, and having a great product
- To deliver their service, they have 100,000s of servers (maybe more?)
- Any one of these servers have a high likelihood of failing at any time
- To create a fault-tolerant system, they have lots of redundancy and lots of sophistication around what happens when an individual box fails
- Contrast this to a big-iron approach that builds all the redundancy into specialized hardware that’s designed to never fail
- Companies hire IDEO to give them fresh designs based on a customer-focused approach
- Part of every project involves lots of brainstorming and coming up with ideas
- However, any specific idea is likely bad (for example, 12 out of 4,000 toy ideas were actually successful = 0.3%)
- Thus, IDEO combines structured brainstorming, rapid prototyping, and field research to rapidly try out new concepts and get to good products
- Contrast this to a process where the “Great Man” designer thinks about a design problem and then comes up with the right solution spontaneously
At 37signals, in particular Ruby on Rails:
- Rails is framework built for programmers to build websites
- Of course, every web project requires lots of lines of code which can easily break at any moment
- If you assume that programmers will more often write code that is buggy and breaks, then you’ll want to make testing and iteration easy – this is at the heart of Agile, TDD, continuous integration, and other related disciplines
- Contrast this to a waterfall engineering approach which assumes the correct design and architecture can be thought out by experienced software engineers
Each one of these examples is similar, yet unique in their own way – but there are similar themes that pervade each one of these approaches.
Characteristics of failure-tolerant systems
Each one of these systems takes the central part of a process and assumes failure, and then builds up a support system around it.
This happens by building on a few core principles:
- Acceptance of failure: You have to accept that shit happens and failure is commonplace – this needs to be internalized so that failure isn’t punished, but rather embraced!
- Massive redundancy: Then, it needs to be easy to have lots of redundancy built into the system – for designers, that means lots of designs get generated. For startups, that means lots of ideas are tested, and for Google, that means lots of servers are used
- Cheap, easy, fast: As a side-effect of the redundancy, it needs to be easy, cheap, and fast to have lots of ideas, lots of servers, or write lots of code. The harder it is, harder it will be to create redundancy
- Iterative, reality-based testing: Testing these individual components constantly becomes key – you need to force failure on the system to figure out how it reacts from a system-wide level
Building up processes based on the ideas above makes it easier and easier to deal with failure and come out on the other side!
Conclusion and next ideas
There are lots of interesting directions that this line of thinking can go.
This area of thinking started out with the hiring process, and the idea that maybe interviews don’t work at all – there’s a bunch of academic research that implies that, actually. So if how would you build a failure-tolerant system around the hiring process, if you assume that good interview candidates actually have no correlation to successful employees?
For dating, what happens if you assume that people you like to date may not be the kind of person you’d have a successful marriage with? What if people suck at figuring out what kind of guy or gal is the “type you’d bring home to Mom?” I think anyone could attest to the idea that many people suck at figuring out the right person to date, much less the right kind of person to marry. I personally find it crazy that people make a 50+year decision to be married based on a 18-month sample size :-)
For careers, what if it turns out that people have a really bad idea figuring out what they’ll actually want to do 40 hours a week, 50 weeks a year, for the rest of their life? How would you figure out the right career faster rather than shorter?
All of these are great thought experiments, I think.
What else am I missing? :-) I’d love to take any suggestions and write up some thought experiments around it.
I write a high-quality, weekly newsletter covering what's happening in Silicon Valley, focused on startups, marketing, and mobile.