5 Self Healing Patterns: Important for Distributed Systems

Mainak Saha
5 min readFeb 7, 2021

Self-healing systems are really in demand, in today’s era of Digital Transformation. Pro-active monitoring and self-healing are two key items to increase the availability of any Financial System.

With correct design considerations, self-healing can be achieved. It is important as industries are looking for more hands-off solution, if it goes down, you don’t need support or dev team’s intervention to make it available again.

Few fundamental criteria for self-healing systems are...

  • Recover itself from an outage.
  • If there is a failure at one part of the system, make sure other healthy parts are running.
  • Automatic resource reduction for faulty parts of the system to reduce unnecessary computing resource utilization.
  • During high load, the ability to prioritize critical functionality to be running, if require de-prioritize or take offline non-critical functionalities.

1. Retry Pattern

The most common way to re-establish a failed attempt to connect to another service. It is a widespread pattern, easy to implement, but most of the time missed or neglected to be implemented. We end up restarting the service or application to get it reconnected.

Something from History ….

While retry pattern is a must for a self-healing system, if it is not implemented wisely, it can bring down the whole system by causing resource starvation, e.g., a service running on an application server stopped responding, client code with retry logic, started getting HTTP 429 error code: too many requests. Because of multiple clients, all retrying may be causing a request pool starvation at the app server end.

This leads to a fancy little implementation known as “retry with exponential back-off”, which is nothing but, instead of using no time or a fixed amount of time between the retry, start increasing the time after each failure. If correctly implemented, this really can be a savior on a rough day.

2. Circuit Breaker

I think this pattern is more commonly asked as an interview question than truly implemented. Circuit Breaker and Retry Pattern offset each other.

Threadpool starvation without Circuit Breaker

Portfolio Viewer and Calculator are part of a distributed system; if Viewer is built with a robust retry mechanism, to save Calculation service from resource starvation, the circuit breaker should be properly implemented. E.g., for every request from Viewer, Portfolio Calculator calls two external APIs in parallel, Market Price and Positions, Orchestrator hold the requesting thread till it merges two responses from those APIs and creates the response. Now, if one fine day Market Price API started taking a long time to respond, there will be a sudden spike of incoming request thread, and all of them waiting for a response lead to thread pool starvation of the app server. Soon this will start impacting other applications running on that server.

Circuit Breaker in Action

To handle this scenario, Circuit Breaker implementation is important. Once the system determines multiple failures, it should put the circuit breaker to Open state and start dropping the incoming request with proper error code, even before wasting resources to call the external API. Internally it will retry; once APIs started responding, it can again flip back the Circuit Breaker to Closed state.

3. Load Leveling

Load leveling is commonly used to save internal systems and databases from the wrath of a huge incoming load.

Service getting crushed

Load balancer is a common option, but Queue-Based Load Levelling also used in many places. With a queue-based processing model, resource use can not be exceeded beyond a limit at any point in time. It’ll always be controlled by the number of subscriptions from the queue. In case of peak load, requests will get queued up at your messaging platform, but never flood your internal Services or Databases and lead to resource starvation.

This is commonly used in Trading and Order Management platforms. Days when the market reacts with something like $GME, this saves your system.

4. Checkpoint

This is very important for batch like long-running applications. If something fails in the middle, and the processing gets stopped, you must design it in such a way so that you can restart from the point where it failed. In simple terms, if you are processing 1 mil messages, you failed at 999,998, when you restart, just process the last 2 messages.

5. Throttling Pattern

Throttling is significant if you are exposing your services through APIs. Most of the API managers, out of the box provides throttling functionality. It is nothing but limiting the number of calls for every APIs.

E.g., these two APIs sharing the same resources /user/userinfo and /user/changeaddress. Now during a usage surge, we have to make sure userinfo API has more access to the resources, as if it fails, there will be bigger outage all over the system. In contrast, changeaddress functionality can be downgraded to lower priority. Through API managers, you can control the number of calls to these endpoints. A lower number to changeaddress will save your internal resources, or provide you time to bump them up during a crisis. It’ll throw some of the users trying to change their address a bad experience, i.e., far better than a complete outage.

If you believe your system has self-healing capabilities, you should try to test them out. Injecting random outages, or turning off instances will help you to fight with Murphy’s Law.

“If One thing can go wrong, it will”

If you are bold enough, and your system supports, try Chaos Monkey by Netflix. It is an extreme way to test your system for D-Day.

--

--

Mainak Saha

Cloud / Artificial Intelligence / Financial Services Enthusiast ..