The following article, written by myself and my colleague, Matt Simons, was published in the December 2010 issue of the Cutter IT Journal and is re-produced here with kind permission. It was also the subject of a talk we delivered in Santa Clara.
The landscape is changing
Since the dawn of the software era, systems have generally followed a lifecycle of develop/operate/replace. For the type of systems our company, ThoughtWorks, specializes in (typically built over the past 10-15 years), organizations expect as much as 5-10 years between significant investments in modernization. And some of the oldest core systems have now reached 40+ years – far longer than the average life-span of most companies today!
IT assets are relatively long-lived largely because modernization often represents a significant investment that doesn’t deliver new business value in a form that is very visible to managers or customers. Therefore organizations put off that investment until the case for change becomes overwhelming. Instead, they extend and modify their increasingly creaky platforms by adding features and making updates to (more or less) meet business needs.
For decades, this tension between investing in modernization versus making incremental enhancements has played out across technology-enabled businesses. Every year some companies take the plunge and modernize a core system or two, while others opt to put yet another layer of lipstick on the pig.
We see this pattern being disrupted as the demands being placed on legacy systems undergo a fundamental shift. Previously, system changes were driven by business requests for incremental features, and IT had to deal with a major new technology platform or architecture only every five years or so. Today, the viable lifespan of any business model is shrinking, driving demand for wholesale feature changes on a nearly continuous basis. For many of these new features, the benefits of leveraging one of the ever-expanding varieties of new architectures and platforms are significant. For example, your current infrastructure was probably not designed to enable you to connect your supply chain directly to an e-commerce channel or provide customer self-service via mobile devices. Sure, you may be able to force that square peg into your round hole, but the chances of an elegant, extensible solution are slim.
The “lifecycle model” of software systems is becoming irrelevant. The companies that will excel in the future will be the ones that learn to incrementally modernize and then continuously evolve their core technology assets to thrive in an ever-more volatile business and technology environment.
The first step is admitting you have a problem
Organizations that are constrained by creaky platforms are often slow to identify this as the root cause of their trouble. Instead, as release cycles grow longer and delivery quality declines, fingers get pointed at IT or at product/service vendors who are getting bogged down trying to work around the underlying problems caused by mounting technical debt in a toxic systems environment.
Figure 1 illustrates some key indicators of an underlying creaky platform, grouped according to whether they are felt more acutely by business or IT stakeholders and by whether they are intuitive or measurable. As you look through the factors, keep in mind that none of these should be considered “normal” or something that “comes with the territory” in IT. In fact, there are organizations out there operating in complex and volatile enterprise environments that do not experience any of these problems.
Figure 1 – Signs of a creaky platform.
We have often found that the intuitive factors manifest themselves in advance of the measurable factors. These can therefore be considered leading indicators that, should they appear, signal you to dig a bit deeper into your underlying technology infrastructure.
When evaluating a complex situation, organizations typically respond well to logical, fact-based arguments supported by quantitative data. So, as you consider whether you need to make a case to modernize, you should apply some measurements to areas of intuitive pain. There are different techniques for quantifying business and IT pain.
Quantifying business pain
Many of the business pain items in Figure 1 are straightforward to measure and don’t require special techniques. Simple counts of bugs, feature backlogs, and release frequencies help quantify the business impact of an underlying creaky platform. These items are especially effective when presented as trends over time versus point-in-time readouts.
However, one common type of business pain – cumbersome business processes – benefits from a more structured quantitative approach. There are many ways to do this, but one of the best is a technique from lean thinking called value stream analysis.1 This approach analyzes customer value-producing processes and identifies areas of waste. Measuring waste provides a powerful quantification of the business impact of cumbersome processes.
Quantifying IT pain
Technical debt is the phrase used to summarize the pain caused by the cumulative effect of all the tactical, short-term, “band-aid” solutions we use on IT projects. Just as in life, where it often makes sense to take on debt, at times it makes sense to go into technical debt in order to reach certain delivery milestones. Organizations that take a conscious decision to increase technical debt are usually sophisticated enough to plan to pay it down over time once the short-term objective has been reached. The problem is that technical debt is often built up unconsciously and, like a runaway credit card account, reaches a point where it becomes very difficult to bring the balance down.
Organizations often unwittingly amass technical debt because the major metrics managers have access to and are measured against are scope and budget-based and rarely include intrinsic quality metrics. The challenge is finding ways to balance these velocity-based metrics with quality metrics that can highlight the hidden cost of the continual tradeoffs that are made.
More objective measures of software quality do exist and can be used to track and control technical debt. Individually, none gives the whole picture, but together they begin to tell a cogent story. These metrics fall into three broad categories:
- Static code analysis (complexity, duplication, cohesion, tangling, etc.),
- Rule violations/programming errors (which can be identified using Checkstyle, FindBugs, PMD, etc
- Test quality (coverage)
Later in this article, we highlight some techniques for using metrics to drive remediation efforts, but when attempting to quantify technical debt, we have found it helpful to compare metrics against a set of well-known applications. Benchmarking a bundle of metrics against a few open source and internal projects of varying, but known, quality provides a clear comparative view of an application’s quality.
Sounding the call to action
Having decided that a system needs to be modernized, you need to get money to do it. With many priorities competing for funding, spending on legacy systems can be a tough sell. In our experience, timing is everything, and specific situations or events can open the door to modernization:
- New leader. Often new executives look to make their mark with a major initiative, and businesses give them leeway to do so. New people are also not vested in relationships and decisions made before their time, giving them more freedom to consider alternatives.
- New rules. New regulations or standards create non-negotiable drivers to update legacy systems. Depending on the scale of change, this may present an opportunity to make a case to deliver the change on a modernized platform as opposed to modifying the existing one.
- Business crisis. Most businesses respond assertively to threats and crises. One of our customers enjoyed years of a near monopoly, despite a core creaky platform that caused them to consistently underdeliver against their product roadmap. The impetus for modernizing that platform came when a major customer
left for a competitor because it had lost faith in the roadmap.
- Opportunities lost. Losing prospective new customers because of the application’s defects and apparent age is a powerful motivator for modernizing a creaky platform.
- New strategy. Aligning modernization with a key business strategy is wise. For example, we worked with an organization that ran an ad-driven community Web site. When business leaders decided to sell that Web site as a reskinnable platform to multiple customers, the development team made a successful argument to invest in modernizing the site.
- Technology breakthrough. Sometimes a new development in technology changes the economics of remediation or creates a new source of return on that investment. Thinking through the applications of new technology to your remediation efforts is worth the time.
IT-business collaboration is critical
Too many organizations make a fundamental error by approaching modernization as an IT-only problem. In so doing, they miss an opportunity to create new business value. They also tend to prioritize the work from a technical perspective, which often results in a quite different approach and solution than one created collaboratively with business stakeholders.
The perils of the IT-only approach were brought home to us during a consulting engagement in which the IT department of an investment bank asked us to validate their modernization roadmap. The roadmap was a plan to replace 29 systems over almost five years, resulting in a cutting-edge IT infrastructure. One of the first things we did was to ask key business stakeholders if the roadmap was aligned with their priorities. We were shocked to discover they weren’t even aware of the initiative. They were very concerned that by proceeding without business input, IT was likely to just rebuild all the redundant and inappropriate systems the business was struggling with, warts and all.
This is an extreme case, but it happens more frequently than you might expect. Fortunately, this story has a happy ending. We were able to broker a conversation between IT and the business that resulted in a major rationalization of the application portfolio and delivered a leaner, better-performing system much more quickly than the initial roadmap. The most successful modernization efforts are jointly planned and executed to deliver against IT and business priorities, incrementally evolving toward a better state for all stakeholders.
Deciding how to proceed
Once you’ve got your funding, you are faced with a decision about how to proceed. The two primary dimensions to consider are refactoring versus rewriting and “big bang” versus incremental. A rewrite can recreate exact feature parity with the existing application, just implemented in a new technology, or it can include redesigning the functionality. Despite the added complexity in testing, we strongly recommend taking the opportunity to identify what functionality is still really needed by the business. The keys to success in this scenario are:
- Working in tight coordination with the business and end users to gain constant verification of fitness for purpose
- Working in very small increments that can be fully validated and vetted
- Keeping the existing application in place to retain all existing functionality in other areas
Refactoring vs. Rewriting
Our general advice is, where possible, to look first at incremental refactoring. Good development practices should always include a refactoring phase when each new feature is added to maintain a simple, elegant, and well-factored design.
We find refactoring is best performed incrementally. Executing a large-scale refactoring exercise in isolation from the main code line (i.e., on a separate branch) should be considered dangerous. The key to a successful refactoring effort is doing it hand in hand with your normal project or production support team, integrating as you go.
If an application has deteriorated to such an extent that refactoring efforts are too big or painful to countenance, then you are faced with a total rewrite. Again there is a choice between big bang and incremental approaches.
Big Bang vs. Incremental
Replacing an application in a big bang is rarely our recommended strategy. Attempting to create feature parity with the legacy application extends timelines to the point that requirements are likely to have changed significantly between design and final delivery. Without feedback from live usage, it is likely the new version won’t meet all the business needs. The risk of the final cutover is also large, since the new application has yet to be battle-tested in production and a full data migration will be required.
Our preferred approach is a phased, incremental strategy. Though this may seem counterintuitive given the extra effort required to work around the existing application, our experience has shown that this minor cost is heavily outweighed by the reduced risk of the migration, the fitness for purpose of the resulting application, and the decreased disruption posed by the overall process.
Many business justifications for replacing an application include claims that the new application will be more “extensible.” There is a false premise that extensibility comes from up-front design activities that define modules, extension points, XML configuration, and the like. Our firm belief is that the best way to end up with an extensible platform is to extend it as you go. If you build your application incrementally, there is a good chance that you will make it extensible, particularly if you put in place the practices and patterns required to extend an application continuously, such as automated testing and simple modular design. Incremental approaches tend to increase the likelihood that your application will support ongoing extension.
The real challenge, then, is effectively performing an incremental rewrite of an application. We have some recommended methods and advice on approach and coordination.
Using metrics and visualization to drive remediation
Technical debt, just like its financial cousin, has the nasty habit of compounding. If you don’t pay it down regularly, then the ultimate recourse is declaring bankruptcy and reaching for the rewrite. The problem with the code metrics tools we mentioned earlier is that they tend to provide too much information to drive actionable remediation decisions. I remember attempting to run Checkstyle across a Fortune 500 client’s code base, and the program core-dumped before completing! In contrast, correlation and visualization are two particularly useful techniques for obtaining a holistic overview of the health of a system and also directing remediation activities.
When one client was struggling to make an impact on their technical debt, we helped them by correlating multiple metrics to direct their remediation activities on the highest-priority problem areas. Our premise was that if an area was complex but rarely touched, then it was less dangerous than one that was under heavy development. Likewise, a complex area covered with good automated testing is less critical than a similar one with no test coverage. Following this thinking, we created an aggregated risk metric that correlates complexity, test coverage, and volatility. Volatility was defined as a function of source control commit activity on the area of code – frequent activity indicated high volatility. This definition allowed us to pinpoint a small set of high-risk areas to address first; in a haystack of millions of lines of code, it called out a few very specific places to begin refactoring and rationalizing.
Toxicity, another aggregated metric, has been playing a prominent role in our “system health checks” at ThoughtWorks. Toxicity charts stack multiple static analysis metrics for classes, methods, or components within an application, providing a combined “toxicity” score for each area of the code base (see Figure 2). This gives our clients guidance on where to start looking to fix problems.
Figure 2 – Toxicity chart
Using visualization in this way allows us to avoid drowning in a sea of data. The human visual cortex is much more efficient at complex pattern recognition than most programs we could write, so it makes sense to leverage that capability.
The basic stacked bar chart used for toxicity is a good start, but if you want to correlate multiple variables, then tree maps are a powerful tool in that their combination of size, location, and color allows you to overlay more complex information onto a single image (see Figure 3). The nested nature of the visualization maps well onto the hierarchical nature of most code bases; color and size are then used to aggregate other metrics such as lines of code, complexity, or coverage. Again, this provides a single-shot overview of health as well as forensic information on where to look for the smoking gun.
Figure 3 – Treemap of code complexity
Tree maps show how the code is organized into packages and classes and visualizes their relative sizes. The size of the various rectangles represents the size of the class files and the encompassing packages. Coloring is then used to layer on an extra metric of interest – in this case, complexity.
Taking this technique one step further, a three-dimensional visualization called a “code city“ gives a real feel for the personality of the different neighborhoods in a code base (see Figure 4). This view of an application as a city supports the analogy that you need to pair program when the code base is so dangerous that you’re afraid to go in alone. A code city visualization is similar to a tree map (see above) but uses the third dimension to overlay an extra metric to correlate. An ideal combination is correlating complexity to test coverage. So, if the visualization maps lines of code to area, complexity to height, and test coverage to color, then a neighborhood containing large, tall, red buildings would represent an area of the code base that contains large, highly complex, and untested classes. Clearly this would be an area in which you would want to proceed with extreme caution.
Figure 4 – CodeCity visualization
As helpful as they are, automated metrics are only part of the story in avoiding technical debt. Code has to communicate effectively with both computer and human audiences. Automated functional and unit tests can tell you how well the code communicates with computers by verifying the expected behavior. However, automated metrics can only hint at how well the code communicates with humans. Human involvement is ultimately needed in the evaluation of any code base. We prefer doing this in real time through pair programming, though code reviews and other techniques can provide similar benefits. Remember: metrics should be the beginning not the end of the conversation.
How to replace your legacy application
Before embarking on replacing an application, it is worth taking stock of your existing organization. Conway’s Law states that the architecture of an application will come to mirror the communication patterns of the organization that created it. This can be summarized as “Dysfunctional organizations tend to create dysfunctional applications.” To paraphrase Einstein, you can’t fix a problem from within the same mindset that created it, so it is often worth investigating whether restructuring your organization or team would prevent the new application from displaying all the same structural dysfunctions as the original. In what could be termed an “inverse Conway maneuver,” you may want to begin by breaking down silos that constrain the team’s ability to collaborate effectively. Of course there are many situations where this may not be realistic, but remember that Conway’s Law talks of the “communication structures” of an organization rather than reporting structures. There are often opportunities to improve communication pathways in lightweight ways without having to grapple with thornier organizational issues.
A set of recurring themes emerges from teams that have successfully executed incremental application rewrites:
- Working on the main code line (or trunk) is vital to avoiding painful merges or missing important improvements occurring in the underlying application.
- Having the team (at least partially) populated with people who have lived with the pain of the existing application and have a deep understanding of the subtleties of the business and technology domain is key in shaping the new product to meet the business’s needs.
- Creating a small, colocated team is also recommended, as is having a clear and focused charter of the business need that each phase of the project is delivering.
- Practicing and executing data migrations is best done from the outset of the project while it is still a tractable problem.
A favored approach for incrementally replacing a live application is the so-called “strangler application” named after the family of tropical strangler figs. These plants grow quickly around an existing tree, using its existing structure for support and shape. Over time they thicken and fuse, completely surrounding and replacing the original tree, leaving a new version standing in its place as the old one withers and dies. The strangler application uses the same approach of creating a thin wrapper around an existing application, then gradually peeling or slicing off and replacing functionality. Features are gradually migrated from the legacy to the new application until nothing of value remains in the old, enabling a graceful retirement.
Common patterns include:
- Intercepting requests at the front of an application, then redirecting certain ones to the legacy application and others to the new application
- Sharing a single integration database or regularly trawling the old database to populate the new one
- Peeling back the application vertically tier by tier (maybe replacing the database, presentation layer, or business logic first)
- Intercepting events
- Some combination of the above
Figure 5 – A basic “strangler application” sequence
Most of these patterns involve creating an extra piece of infrastructure, such as an indirection layer or a polling system, which may appear to be wasted effort. Interestingly, we regularly find that the strategies required to enable a gradual migration often prove to be valuable long-term architectural buffering devices that both provide resilience during system upgrades or outages and offer the seams needed for future enhancements. An example of this was when we provided a major ISP a mechanism to replace their application stack incrementally by separating the system that captures new client orders from the system that processes them. Later on this separation proved invaluable in enabling them to continue taking orders even while their main order processing system was down for maintenance.
Breaking the cycle of pain
The recommendations here provide suggestions for how to modernize, but that is only half the battle. To really elevate your enterprise to the next level, you need to break the modernization-degradation cycle permanently. Fortunately, the tools and approaches that help you incrementally modernize are exactly the same ones that can eliminate the need to ever do it again:
- By using sophisticated automated metrics to continuously identify degrading areas of your system, you can strike a better balance between incremental remediation and adding new features going forward.
- By practicing an incremental, evolutionary approach to architecture (called a “strangler” in the context of remediation efforts, but just “evolutionary architecture” outside that context), you can avoid the need to embark on big-bang replacements.
- By continuing to add tests at the same rate as you add new functionality, you can create a hygienic technical environment that gives you the confidence to make significant architecture or technology changes when a new business requirement necessitates them.
Organizations that continue to think “system lifecycle” will, in the long run, lose ground to those that think “system evolution.” If you’re about to invest in a modernization effort, why not do so in a way that positions you for a fundamentally different future?