Let It Fail

Some time ago I found myself leading half of engineering at a young startup. My group had been formed around a philosophy of platform value and as such had taken on a large project to migrate our application services to a new architecture. In parallel the business was evolving too and many new features were already being planned, each deeply integrated into the legacy system we were frantically trying to move on from. Quickly I realized the business and engineering teams were on a collision course.

During my next one-on-one with my manager I raised my concerns, "We're mired in technical debt and dealing with outages nearly every day," I said, effusing the frustrations of the team. These symptoms were well-known and we were sure that the root cause would be addressed by our platform work. There was only one catch: not only was this platform work not delivering incremental value, it was creating more cognitive overhead as effectively two systems existed in parallel until it shipped. With the business asking for more engineering bandwidth, we were in danger of being in limbo indefinitely.

"We must stay focused on the platform work," I declared with confidence. My manager, nodding, agreed unequivocally, "You're absolutely right, Max." Ready to transition to pitching my plan to keep us on course, he surprised me by continuing, "And we're going to let it fail." I was sure I had misheard but he continued, "It will fail too. Exactly in the way you've predicted." Realizing I had not misheard, I paused for a moment to recover. Now thoroughly confused, I could only ask, "But shouldn't we intervene?" My manager smiled, "Oh you certainly could and I know you would address the acute problem but it wouldn't do anything about the chronic ailment."

As he explained I was beginning to understand.

Letting Things Go Sideways

What I had failed to see as I assessed the situation was that while things would certainly break, our work would be delayed, the team would endure a longer period of maintaining both a new and old system, the fallout would be relatively limited and given our scale and working process, quickly corrected. All this meant that the cost of allowing things to go sideways for a bit was relatively low.¹ Moreover, it represented an important learning opportunity for the broader business which would generate broader buy in and allow us to dramatically improve process.

In fact articulating the perils of foregoing the platform work and building on technical debt was crucial: while both my manager and I had succeeded in selling the idea in theory, in practice the business was still struggling to map this to the day-to-day business needs. The fact the work had little to no incremental value made this situation more challenging. Perhaps failure could be a helpful illustration.

Back Pressure

In networking and distributed systems, there's a concept known as back pressure. This idea revolves around the notion that different connected points within a system can prevent themselves from becoming overwhelmed and failing completely by controlling the amount of inbound data they accept. Essentially, back pressure enables a kind of controlled failure, and software systems can be designed around this principle to achieve greater scalability and resilience.

Similarly, organizational processes can be thought of as systems that can also benefit from feedback loops based on the principles of back pressure. By implementing such feedback loops, organizations can limit the amount of work they take on, and thus reduce the risk of becoming overwhelmed and failing. This can help improve overall process resilience and scalability, resulting in a more efficient and effective workflow.

Although it's often possible to iteratively optimize software systems, this alone may not result in a more resilient system.² Much like a situation where an individual intervenes and prevents a minor catastrophe from occurring–while this may address the immediate issue, it doesn't necessarily prepare us for similar future disasters. Instead, we can enhance our processes by incorporating appropriate back pressure, allowing upstream parts of the system to become more aware of potential issues and adapt accordingly. By doing so, we can increase our overall preparedness and improve the system's resiliency, rather than simply addressing acute problems as they arise.

Heroism

Interventions suffer from another problem as well: they often rely on the heroic efforts of an individual who identifies a problem or opportunity and decides to take action. While it may be difficult to see the potential drawbacks of this approach in the short term, a longer time horizon reveals that heroism is often an anti-pattern.

Returning to the notion of systems, we can draw on the idea of single points of failure. When a system relies on one node in a graph and when that node disappears or becomes degraded in some way, the entire system is necessarily impacted.³ Similarly, our human systems can form single points of failure as they become dependent on individuals. Consider that if our hero disappears on a months-long backpacking adventure in Europe and without her to save the day our velocity slows to a crawl, we've stumbled upon a systemic reliance that reveals deeper fragility.

As such generally we want to discourage a culture of heroism and seek to build systems and processes that don't require it.

Enjoying the article?

Subscribe for free to receive new articles when I publish them.

How I Learned To Stop Worrying

Things did indeed break. But as they did something else happened: our product team began to see the legacy system would not support the business goals and they went from somewhat passive admirers of the theory to active evangelists of the platform work. As the legacy system buckled under new demands, the conversation quickly evolved from, "How do I prioritize this new feature?" to "How do we create space for holistic system work such that we can build better features?" When it became evident the platform work supported the net-new work, the product and engineering teams led prioritization together.

I must admit, it wasn't easy for me to resist the urge to intervene. I have a natural inclination to jump in, get my hands dirty, and help. However, by creating room for some controlled failure, we gained a broader appreciation for the limitations of the legacy system outside engineering, as well as the need for a stronger technical foundations to support our business. The organic back pressure created by the legacy system falling over was enough to steer us back on course, and at a relatively low cost. Moreover, it provided us with the opportunity to focus on developing resilient process that didn't require individual heroics to avert disasters.

Ultimately we completed the platform work with the enthusiastic support of our business stakeholders. This formed the basis of future work and exceeded expectations in virtually every dimension.⁴ Not only had we solved the worst pain of the legacy system we also unlocked increased velocity which allowed the company to move more quickly with net-new feature work, bringing value to our customers sooner and helping the business reach ever increasing levels of growth.

As leaders, it's tempting to think we need to act. Isn't it our job after all? Sometimes it is. But whether we need to jump in or not is derivative of a more fundamental directive.⁵

Sometimes the most powerful action we can take is no action at all.

My biggest concern was that the team would burn out on incidents. However, we shared incident work equally between both individual contributors and people managers, myself included. So while a real cost, it wasn't left to the team alone as management made such a decision. ↩
At least not if we haven't made that an explicit goal. This is in contrast to considering the broader systemic pieces and how they fit together in the context of failure. ↩
This is why single points of failure are often one of the first things engineers attempt to identify when planning for resilience. ↩
The team had been right about the root cause of instability all along. ↩
So much of leadership happens before any decision or action. It's easy to conflate the result with the process but they are distinct and separate. The artifacts of leadership, like the code of a program, are mere shadows cast off the work itself. ↩

Letting Things Go Sideways

Back Pressure

Heroism

Enjoying the article?

How I Learned To Stop Worrying

Footnotes

A Newsletter to Share My Knowledge