The Second System Problem

Chris Szymansky
7 min readMay 22, 2016

In The Mythical Man-Month, Fred Brooks outlines The Second-System Effect, which “is the most dangerous system a man ever designs.” This post looks at first systems, second systems, and how the trappings of second systems can be avoided.

The first system

The goal of an engineering team at a startup is to get a functional software product built as quickly as possible. The product typically needs to be “just good enough” quality-wise, while fitting into a shoestring budget and tight deadlines.

Over time, more features are added and corners are inevitably cut, but it’s all done in the name of working quickly and getting something in front of customers. The team changes and evolves, the guy who wrote [insert despised feature here] leaves the company, and the remaining team is tasked with maintaining and building on the first system, which is backed by an increasingly messy codebase and dated technology.

Even if the first system works well enough, the team has ideas for improvement and starts to have discussions about a successor, or second system, where some of the issues of the first system can be corrected.

The second system

Second system discussions typically involve two competing schools of thought, people who want to start over from scratch and people who want to maintain the first system.

“We need to start over from scratch!”
— An Engineer

Engineers get wide-eyed with excitement over the possibility of building a new system while managers get nauseous over the time and budget required.

A risk that comes with this approach is the company putting the brakes on development of the current system and “going dark” until the launch of the new system. Delivering value to customers should never be sacrificed, so take this approach at your own risk. Also, just because an ambitious engineer estimates that the full rebuild will take “only [insert low number of months] months”, the likelihood of it actually being completed in that time is…low.

Despite these risks, some teams have taken this approach. Basecamp’s launch of “New Basecamp” (Yep, from all the way back in 2012. They’ve since launched a 3.0.) was an example of freezing the original Basecamp and doubling down on the new product. Every team needs to consider their own capabilities and risk tolerance.

Some teams try to mitigate this risk by splitting into two groups with one working on “legacy” (first system) and one working on a “new” rebuild from the ground up (second system). This rarely works, as the best engineers want to work on the new system and no one (voluntarily) wants to work on the legacy system. Engineers stuck on “legacy” get dissatisfied and feel like they are limited in acquiring knowledge, as the team on “new” gets to learn some shiny new technology.

The team working on “new” has to catch up to the legacy team in terms of features, but “legacy” is also building new features, so the bar that the “new” team has to hit keeps rising.

Finally, if the “new” group succeeds, then the QA team has a monumental task on their hands to make sure defects are removed across the whole new system before it’s released to customers. The experience may be jarring for users and the support team will need plenty of caffeine to handle all of the complaints and bug reports (half of which will actually be misunderstood new features). New Basecamp’s rollout was not without its fair share of issues, primarily around items that were removed, like message privacy.

One again, every team needs to consider their own capabilities and risk tolerance.

“We can always rebuild it later, just keep hacking on the current system!”.
- A manager

Managers are exited that the team can maintain the same feature development pace by hacking away at the current system, but the engineering team gets nauseous over the ever growing pile of technical debt below the surface.

Holding steady on the current system indefinitely is likely to be unsustainable long term, so the engineering team’s leadership needs to clearly identify to management what the risks of not investing in a new system are.

However, it’s also important to not use hyperbole and paint things as worse than they actually are. Having data to show defects caused by bad legacy code, or a pattern of delays due to [insert despised feature here that a former employee wrote] needing to be modified every time a small change to the codebase is made helps explain why the status quo may not be sustainable.

The compromise

It’s possible, and in my experience, preferred, to continue to invest in the current system while iteratively building and shipping a new system. A second system can co-exist with a first system, but it takes substantial amounts of engineering and architectural planning, as well as sound product management to have a fighting chance of pulling it off.

Step 1) Get all key people (technical and non-technical) in a room and explain the problem. Make it clear that an investment must made in both the current system and a new system. The current system is what pays the bills, so it can’t be neglected, but if no investment is made in something new, it’s possible that the old system eventually collapses like a house of cards. A technical debt metaphor typically resonates with technical and non-technical team members.

Step 2) Spend some time with a small group investigating and mapping out a future target architecture (tech stack, frameworks, code standards, etc.). Ask the question, “if we build a new system, what would we change or do differently?” This will likely be a grandiose vision. Sit on it for a week, come back to the table and work to pair it down to the minimum viable architecture that sets you up for the future, but isn’t ridiculously over engineered.

Spend time discussing how both the current system and new system can co-exist in a production environment. Bonus points if users will ultimately be able to jump between features without realizing they are on a new technology. To an end user, there should be no “second system”. To them, it should just feel like an evolved version of what they are already using. Sometimes the best feature release is the one that no one notices, because it just works.

Step 3) Explain to the engineering and product teams that there is a need to be pragmatic in regards to the new system. Make it clear that the goal is not to re-write every line of code from scratch, but rather identify high value new features and existing features that can be moved to the new system in an iterative manner. Decide on a feature-by-feature basis what will be built on the new architecture and when (some low value features may never need to be moved). Account for the fact that features built on “new” will have some additional complexity as technologies are being learned and some course correction may be required, whereas features built on “legacy” accumulate more technical debt.

Use discretion when deciding what should initially be built on the new system. Find something valuable, but ancillary to core workflows in the product. The goal should be to avoid running into something with tons of dependencies.

The ultimate extreme of bad planning is getting into a scenario that requires that everything needs to be ported at once, which is a Really Bad Thing™, and greatly reduces the project’s chance at success. Having delays and large amounts of dependencies out of the gate will not make a good impression on management, so set the team up for success by choosing projects wisely.

Step 4) Build the first pilot feature on the new system. Choose a subset of the team that has different skill sets and experience levels. Don’t consume the entire team with the project. Avoid a team that is too junior (they may run into difficult technical issues) and also avoid a team that is too senior (this may create the impression that all of the senior engineers are working on the “new” system, and you still need technical leadership on other projects).

Step 5) Ship as soon as possible. The perfect piece of software still hasn’t shipped, so just focus on getting yours in front of customers.

Getting the first iteration of the new system into production quickly is essential. Every week and every month where things are being fine tuned increases the odds of scope creep. Once live, closely monitor the usage and performance of the new system and be prepared to quickly tackle any major problems.

Step 6) Choose the next feature up for buildout. Getting to this point is a huge accomplishment because it means that two systems are co-existing and Brooks’s Second-System Effect has been avoided or mitigated. Continue to repeat steps 3–5 with new or existing features. Rotate the team working on the project so that eventually every team member gets training and experience with the new system.

Training the team is crucial. Match people who have experience with the new system with people who do not. Think of it as a technical mentorship program until everyone has shipped code on the new system.

Gradually get more teams working on the new system in parallel. For example, if there are three subteams working on building features, maybe have one working on the new system and two working on the legacy system. Eventually, expand that so that two teams are on “new” and one is on “legacy”, and finally, get all three teams working on “new”.

Depending on the complexity of the project, it may take months before every subteam is working on the new system and it may take months or years before the whole system is the new system. But that’s ok, an agile, iterative approach to delivery that is ingrained in the engineering, product, and management teams may mean that you never need a “third system”. And that’s the topic of another post.

--

--

Chris Szymansky

CTO at Fieldguide (https://fieldguide.io). Prev. engineering and product at Atrium and JazzHR.