Taming Software Complexity

Complexity is everywhere in our world: In our ever-growing canon of laws, in the volatile & unpredictable nature of the stock markets (esp. now with the abundance of autonomous trading systems), in our tax codeand of course in the second law of thermodynamics. It's little wonder then, as programmers the world over know, that complexity is definitely present in our software. Of all the long term threats to applications, complexity is perhaps second most critical (the first being no longer meeting user needs :-).

(Complexity can be beautiful too. Source)

Unfortunately, complexity goes hand in hand with success: the more popular an application, the more demands that are placed on it, and the longer its "working life". Left to their own devices both factors will increase complexity significantly. Life for a mildly successful app is even worse, the low demand usually results in a never-ending "maintenance mode" where poor code stewardship often abounds.

Without ways to tame complexity, any evolving piece of software, no matter how successful, will eventually collapse under the load imposed to simply maintain it.

How is software complexity defined? Many techniques have been proposed, from simple approaches such as lines per method, statement counts, and methods per class, to more esoteric-sounding metrics like efferent coupling. One of the most prevalent metrics in use today is cyclomatic complexity, usually calculated at the method level and defined as the number of independent paths within each method. Many tools exist to calculate it, at RelayHealth we've had good success with NDepend.

Identifying areas of complexity in the code base is easy. The hard part is deciding what to do about them. Options abound...

Big Ball of Mud
The "Do Nothing" approach is always worth exploring and it typically results in Brian Foote's Big Ball of Mud. Foote wrote the paper as, however noble the aspirations of their developers, many systems eventually devolve into a big, complex mass. He also notes that such systems can sometimes have appealing properties as long as their working life is expected to be mercifully short. Fate often intervenes though and woe betide the programmers stuck maintaining a big ball of mud.

Creating a big ball of mud is easy, just add code and mountain dew :-)

Let's assume that you'd like to stay out of the mud. What other options are there?

Process
Some simple process changes can help fight complexity:
  • Analyze code metrics upon checkin and reject the new code if the files changed don't pass complexity targets (this will initially slow down development if  you impose it mid-flight but it will improve your code quality).
  • Allocate bandwidth for complexity bashing: reserve capacity such as 1 sprint every release, or a %age of total story points (e.g. 20% of all completed story points every month).
  • Temporal taming: Focus on different parts of the architecture over time, say a new area every month.
  • Something I've been wondering about: Are there processes that promote complexity? Or are some so time consuming that they prevent developers from addressing complexity?
  • Automation is a powerful tool. You can easily add exceptions to a manual process ("Oh, well if it's an app server in cluster B, then we need to run this additional program") but an automated process is a lot harder to complexify, and if it needs additional steps at least you'll know their execution will be consistent.

Architecture
Complexity has spawned many solutions at the architecture / software engineering levels, though even something as basic as ensuring developers all have a common understanding of the architecture and documenting its basic idioms can go far. Other solutions are very well covered in our industry:
  • Design patterns. Tried and true approaches to common problems.
  • Aspect Oriented Programming. AOP's focus on abstracting common behaviors from the code base can reduce its complexity.
  • Service Orientation. Ruthlessly breaking up your applications into disparate, independent services reduces the overall complexity of the system. This is an SOA approach but without the burdening standards and machinery that armchair architects are prone to impose. One of my favorite examples of this approach, Amazon.com, has been using SOA since before anyone thought up the acronym. By creating loosely coupled services with standard interfaces it's much easier to update or completely replace a service compared to the same work in an inevitably intertwined monolithic application.

Culture
The most powerful weapon against the encroachment of complexity is culture: the shared conviction among developers that everyone needs to pitch in to reduce it.
  • Refactoring: developers should feel empowered to refactor code that's overly complex, not in line with the evolution of the architecture, or simply way too ugly. Two key enablers are required and both need a strong cultural buy-in:
  • A solid set of unit and other tests so the developers knows if they've broken something
  • A fast build & test cycle. Most developers like to work in small increments. Make a small change, test it. If it takes 15min for a build & test cycle, very few developers are going to refactor anything that isn't directly in their path. I really like the work Etsy has done in this area as well as culture in general by focusing on developer happiness
  • Adopt this maxim: "Leave each class a little better than when you found it". Even if it's a small change - adding a comment, reformatting a few lines of code - taken in aggregate these changes really add up over time.
  • Remove features. I heard one of Instagram's founders state that they spend a good deal of time removing features as well as adding them. That was probably a slight exaggeration, but removing features can be very powerful in terms of fighting complexity: both directly (fewer lines of code == lower complexity, except with perl ;-), and indirectly as a signal to the team and your customers.
  • What have I missed? I haven't written about complexity as the database level though, while we're on the topic, I suspect that however much I like NOSQL databases, their rise will increase data complexity in the long term. The leeway many provide developers in storing information will make it very hard to manage it: data elements will be needlessly duplicated, inconsistencies within elements will abound, etc. Error recovery will be critical, as will a strong business layer to help provide consistency.

    (Another source of complexity! Source :-) 

    Happy simplifying!

    How eBay Scales its Platform

    In these days of discussing how Facebook, Twitter, Foursquare, Tumblr, and others scale, I don't often think of eBay anymore. Yet eBay, despite its age and ugly UI, is still one of the largest sites on the internet, esp. given its global nature. So I enjoyed this QCon talk by Randy Shoup, eBay Chief Engineer, about Best Practices for Large-Scale Websites.

    Here are few lessons that caught my eye:
    • Partition functions as well as data: eBay has 220 clusters of servers running different functions like bidding, search, etc. This is the same model Amazon and other use
    • Asynchrony everywhere: The only way to scale is to allow events to flow asynchronously throughout the app
    • Consistency isn't a yes/no issue: A few datastores require immediate consistency (Bids), most can handle eventual consistency (Search), a few can have no consistency (Preferences)
    • Automate everything and embed machine learning in your automation loops so the system improves on its own
    • Master-detail storage is done detail first, then master. If a transaction fails in the middle, eBay prefers having unreachable details than a partial master record. Reconciliation processes clean up orphaned detail records
    • Schema-wise, eBay is moving away from strict schemas towards key/value pairs and object storage
    • Transitional states are the norm. Most of the time eBay is running multiple versions of its systems in parallel, it's rare that all parts of a system are in sync. This means that backwards compatibility is essential
    • "It is fundamentally the consumer's responsibility to manage unavailability and quality-of-service violations." In other words: expect and prepare for failure
    • You have only one authoritative source of truth for each piece of data but many secondary sources, which are often preferable to the system of record itself
    • There is never enough data: Collect everything you never know you'll need. eBay processes 50TB of new data / day and analyzes 50PB of data / day. Predictions in the long tail require massive amounts of data

    Platforms as a Service Revisited

    In September 2007, Marc Andreessen wrote a thought provoking blog post describing a way to categorize different types of Platforms as a Service (PaaS). Over the years we’ve often made use of Andreessen’s levels within our Engineering team as a convenient way to discuss how we want our own platform to evolve.

    So I was surprised when I went looking for that article the other day and it was nowhere to be found on Andreessen’s blog! Fortunately I was able to rescue it from oblivion thanks to the internet archive. I’ve also uploaded a PDF of the article to Scribd.

    Andreessen’s premise is that there are three levels of internet platforms (the term Platform as a Service didn’t exist back then):
    1. At level 1 a platform is essentially a series of APIs in the cloud (another term that had yet to make its appearance) that app developers can leverage in their own apps.
    2. The prime example of a level 2 platform is Facebook. In addition to the APIs it makes available, Facebook also gives developers a way to embed their apps into its user interface.
    3. A level 3 platform achieves something the other two levels can’t: It runs developers’ apps for them. Examples here include Force.com (Salesforce.com’s PaaS) and Andreessen’s own Ning.

    As the levels go up they become easier for developers to program on and manage. A company working on a level 3 platform shouldn’t need to worry about hardware and operating systems. The categories aren’t perfect though: Amazon’s Web Services offerings are clearly level 3 (they run your code) while forcing you to still manage a virtual infrastructure.

    That said, many platforms do fit the model well. Thousands of companies offer APIs and therefore qualify as level 1. Platforms such as Google App Engine, Microsoft Azure, Heroku, EngineYard, etc. all offer flavors of level 3, some “purer” than others, I.e. with more or less hardware/OS abstraction.

    At RelayHealth we put a number of APIs at our partners’ and clients’ disposal. Some are web services, others rely on healthcare specific protocols such as HL7 or CCD riding on top of a variety of communication channels.

    Our approach to level 2 turns Andreessen’s definition inside out: instead of embedding third party apps into our UI, we make it easy for them to customize and embed our modules into their applications. This is important to many partners as building features like ePrescribing themselves is prohibitive. By providing these capabilities we enable our partners not only to deliver key features to their customers but also complete their EHR certifications (vital so their clients qualify for federal incentives).

    Regardless of your approach, if you’re building a whole platform or some simple APIs, Andreessen’s article is worth reading.

    Software Architecture should be forged in Fire, not Carved in Ice

    (Picture by Jason Bolonski)

    I've seen a number of corporate environments that carve software architectures out of ice.Why ice? Because an ice architecture sure looks great: sparkling, pristine, perfect even. This approach often feels right, esp. in cost conscious, slower moving organizations. You know how it goes: spend the bulk of your time in design, figure things out properly, measure twice (or thrice), cut once, and then you're set. Sadly once carved, often the only thing left to do with such an architecture is freeze it, lest it melt. And a frozen architecture is rarely useful.

    The problem is that you're never set. Needs keep changing and if your architecture can't evolve with them, you've (best case) got a working but unmaintainable and unevolvable app, or (worst case) something that becomes unusable, even by the people who need the application the most and who are willing to put up with its flaws.

    The architecture you want is not one carved in ice. Rather, you want something you've not only heated and beaten into shape to serve your current purpose, but also a design that you can reforge in to something new as needs dictate. So how do you achieve this?

    Understand your problem space & key challenges
    I'm not advocating no design, I'm advocating just enough design. Knowing how far to go is both art and science. Two things that will help is a good understanding of the problem space and the main obstacles you'll face. If you're building a social app you need to have at least broad designs for your sharing / trust model for users, how you will distribute data and scale, what security choices to make, etc. If you don't have strong expertise in house, hire someone who does, even as a part time consultant. It will be money well spent.

    Rapid Iteration
    Focus on speed. Not at the expense of quality but at the expense of features. Build your Minimum Viable Product, get it live, get it used, and iterate. The only way to really learn what works and what doesn't is to let your users at your app. The faster your iteration the more you can adapt (reforge) to changing needs. This is one of the reasons that agile development has become the de facto development approach in the past decade. Continuous deployment and delivery are other, welcome, instances of this trend.

    Less is more
    Build what you need and improve when needed. This goes hand in hand with rapid iteration and goes against the "what if" architecture. "What if I want to go live in other countries? Oh better internationalize", "What if we need to support suppliers as well as customers? Good point, better code them in". "What if I need to scale to hundreds of millions of users?" The list goes on and the longer you let it go the slower you'll be, not just in development but in maintenance and new feature additions.

    This doesn't mean making uneducated decisions. If you think there's a good chance you'll need to go international in future, leverage a framework that supports it. But don't build it internationalization (i18n) until you're ready to use it. I'd argue that the cost to support i18n - testing multiple languages across the app, different formats & interfaces, new & altered business logic, etc. - is not worth saddling your dev team with against the day when you finally need the feature.

    Ultimately, the fewer lines of code the better. To paraphrase Dijkstra: it's not lines of code produced, it's lines of code spent!. So spend wisely.

    Test Driven / Refactoring
    If there was one software engineering practice I'd enforce, this is it. A comprehensive and robust set of tests gives you the confidence to make radical structural changes to your application and still have a working product at the end of process. I've seen business leaders question the value of putting in effort here. Understandably they're concerned that all the time spent coding tests could have been better spent coding new features.

    A counter-argument to this is to remind the business folks that, once tests are written, all the downstream QA is not only free but extremely rapid. That's when you can reassign developers to new areas confident in the knowledge that you'll detect breaks long before they ever reach end users. As your codebase grows these tests will be a godsend to help avoid spending all your time on maintenance.

    "Less is more" is your friend here too: focus on DRY (Don't Repeat Yourself). The DRYer your code, the less of it you have, so the fewer tests you'll need and the easier it will be to reforge.

    Culture
    The most fundamental and so the most important. The culture of an organization is represented by people's shared values, goals, and behavior. Whether implicit or explicit, culture is the bedrock on which all else rests. The more individuals align with the culture, the more effective the team. This buy-in means that a successful culture cannot be dictated (typically by management), it must be nurtured.

    Culture will obviously vary by company but these common values will support all the principles listed above:
    • Continuous improvement: Continually striving to make things better, to achieve ever higher quality, and redefine goals as necessary 
    • Trust: Despite the best laid plans, failures happen. When they do, an organization needs to display enough trust in individuals and team to allow them to fix the problem, learn from the experience, and come out stronger
    • Collaboration: In my experience, tech and business are often in push-pull. Tech rarely gets the necessary time or resources, and business rarely gets all its desired features. The principles above drive long term value. If the culture prizes collaboration, openess, and sustained value, then neither group will want to sacrifice the short term for the long term

    Update (2011.2.3): Great article on the importance of culture from Wealthfront.

     

      What about...?
      What about separation of concerns, SOA, AOP, and more? All of these have their place and can certainly improve your application and architecture. Design patterns, if properly used, can make your software design more flexible and reduce the amount of refactoring needed to add new features. Still, these practices are should haves, not must haves. They can make good architecture great, but on their own won't keep it great in the long run.

      Ultimately the key to building great software is great people. Finding them, building the right culture together, and continuously evolving the organization and its processes. Over the years, new software engineering practices will come to light but these principles will evolve much more slowly.